Introduction
The surge in e-commerce has underscored the importance for online retailers to efficiently manage inventory, anticipate demand, and refine sales strategies. For a UK-based online retail company specializing in unique all-occasion giftware, comprehending sales patterns and forecasting future demand is vital to ensure customer satisfaction, avoid stockouts, and minimize surplus inventory. Nevertheless, the company encounters difficulties in precisely predicting sales due to fluctuating customer demand, seasonal variations, and substantial wholesale orders. This project seeks to tackle the challenge of forecasting future sales by examining historical transaction data from the company's online store. The primary objectives of this project are: 1. To analyze historical sales data to identify trends, patterns, and key insights. 2. To forecast future sales using various time series models and machine learning techniques. 3. To provide actionable recommendations
Requirements
1. Python (for data manipulation, analysis, and visualization) 2. Pandas (for data manipulation and analysis) 3. NumPy (for numerical operations) 4. Matplotlib and Seaborn (for data visualization) 5. Scikit-learn (for machine learning models) 6. Statsmodels (for time series analysis) 7. Google Drive (for data storage in google sheet file) 8. Jupyter Notebook (for interactive data analysis and visualization)
Problem Statement
The surge in e-commerce has underscored the importance for online retailers to efficiently manage inventory, anticipate demand, and refine sales strategies. For a UK-based online retail company specializing in unique all-occasion giftware, comprehending sales patterns and forecasting future demand is vital to ensure customer satisfaction, avoid stockouts, and minimize surplus inventory. Nevertheless, the company encounters difficulties in precisely predicting sales due to fluctuating customer demand, seasonal variations, and substantial wholesale orders. This project seeks to tackle the challenge of forecasting future sales by examining historical transaction data from the company's online store.
Achievement
Comprehensive Data Cleaning: Successfully handled missing values, duplicates, and outliers, ensuring the accuracy and reliability of the sales data. Seasonal Sales Insights: Identified clear seasonal patterns in sales, with peaks around the holiday season, aiding in strategic planning. Top Product Analysis: Determined the top-selling products, providing valuable insights for inventory and marketing strategies. Regional Sales Performance: Analyzed sales performance across different countries and regions, highlighting the United Kingdom and Europe as top performers. Accurate Sales Forecasting: Tested multiple forecasting models (ARIMA, SARIMA, Holt-Winters, RandomForest) and identified the Holt-Winters model as the most accurate, with a MAPE of 21.87%. Actionable Recommendations: Provided strategic recommendations for inventory management, marketing, and resource allocation based on the analysis and forecasts. Visualization: Created clear and informative visualizations to illustrate sales trends, top products, and forecasted sales, aiding in data-driven decision-making. Model Performance Evaluation: Conducted a thorough evaluation of model performance, including sensitivity analysis, to ensure the robustness of the forecasting model. Strategic Planning Support: Equipped decision-makers with actionable insights and accurate forecasts to optimize sales strategies and improve overall business performance.
Data Preparation
This code lists the contents of the directory located at '/datasets/robertadrive'. It essentially shows the names of all files and folders present in that directory.
Importing Data
Explatory Data Analyis
DATA SHAPE
This code returns the shape of the DataFrame 'sales_data', which provides the dimensions of the dataset in the form (number of rows, number of columns).
SUMMARY STATISTICS
This code generates summary statistics for all columns in the 'sales_data' DataFrame, including both numeric and non-numeric columns. It then stores these statistics in the 'summary_statistics_all' DataFrame and displays them.
MISSING VALUES
The dataset contains missing values in the `Description` and `Customer ID` columns.
DATA TYPES
The data types for each column in the `sales_data` DataFrame are as follows: - Invoice: object - StockCode: object - Description: object - Quantity: int64 - InvoiceDate: datetime64[ns] - Price: float64 - Customer ID: float64 - Country: object
OUTLIERS
This code reads the 'sales_data' DataFrame and visualizes the distribution of 'Quantity' and 'Price' columns using boxplots to identify outliers. It first creates a figure with two subplots for 'Quantity' and 'Price' boxplots. Then, it calculates the Interquartile Range (IQR) for 'Quantity' and 'Price' using their 25th and 75th percentiles. Using the IQR, it computes the lower and upper bounds for outlier detection. These bounds are stored in the 'outliers_info' dictionary for both 'Quantity' and 'Price'.
The boxplot analysis and the calculation of the interquartile range (IQR) for the `Quantity` and `Price` columns reveal the following: 1. Quantity: - Lower Bound: -12.5 - Upper Bound: 23.5 - Any quantity value below -12.5 or above 23.5 is considered an outlier. 2. Price: - Lower Bound: -3.19 - Upper Bound: 8.65 - Any price value below -3.19 or above 8.65 is considered an outlier. These bounds help identify outliers in the dataset for the `Quantity` and `Price` columns.
NOT OUTLIERS
There are 432,647 records that are not outliers in the dataset.
%AGE OF OUTLIERS
The proportion of outliers to total records is: - Quantity Outliers: 11.01% - Price Outliers: 6.71%
DUPLICATES
There are 6,865 duplicate records in the dataset, which constitutes approximately 1.31% of the total records.
Data Cleaning
CORRECTING DATA TYPES
The data types for `Customer ID` and `InvoiceDate` have been successfully changed.
REPLACING MISSING VALUES
All missing values in the 'Description' and 'Customer ID' columns have been successfully replaced.
REMOVING DUPLICATES
All duplicate records have been successfully removed.
HANDLING OUTLIER
Outliers have been successfully removed. The cleaned dataset now contains 426,820 records.
Sales Data Aggregation & Analysis
MONTLY TOTAL SALES
MONTHLY SALES TREND
The plot of monthly sales trends shows the total sales for each month from December 2009 to December 2010. Here's a summary of the observed trends: December 2009: The sales start at a high point, likely due to the holiday season. January to February 2010: There is a noticeable drop in sales after the holiday season. March to April 2010: Sales begin to recover and show an upward trend. May to August 2010: Sales remain relatively stable with minor fluctuations. September to October 2010: There is a significant increase in sales, possibly due to preparations for the upcoming holiday season. November 2010: Sales peak, likely due to holiday shopping. December 2010: There is a sharp decline in sales, which might be due to the end of the holiday shopping period. Overall, the sales data shows a seasonal pattern with peaks around the holiday season and lower sales in the months following the holidays.
TOP 10 PRODUCT SALES
TOP 10 COUNTRIES SALES
SALES PER REGION
INSIGHTS FROM SALES ANALYSIS
Monthly Sales Trends: - The sales data shows a seasonal pattern with peaks around the holiday season and lower sales in the months following the holidays. - December 2009 and November 2010 have the highest sales, likely due to holiday shopping. - There is a noticeable drop in sales in January and February 2010 after the holiday season. Top 10 Product Sales: - The top-selling product is "WHITE HANGING HEART T-LIGHT HOLDER" with total sales of 52,879.37. - Other popular products include "VINTAGE UNION JACK BUNTING", "EDWARDIAN PARASOL NATURAL", and "HOT WATER BOTTLE KEEP CALM". - The top 10 products contribute significantly to the overall sales. Country Sales: - The United Kingdom is the top-performing country with the highest total sales. - Other top-performing countries include Germany, France, and the Netherlands. - The sales from the top 10 countries contribute the majority of the total sales.
Time Series Analysis For Sales Forecast
SALES TREND
FINDINGS
The plot of monthly sales trends shows the total sales for each month from December 2009 to December 2010. Here's a summary of the observed trends: December 2009: The sales start at a high point, likely due to the holiday season.
January to February 2010: There is a noticeable drop in sales after the holiday season. March to April 2010: Sales begin to recover and show an upward trend. May to August 2010: Sales remain relatively stable with minor fluctuations. September to October 2010: There is a significant increase in sales, possibly due to preparations for the upcoming holiday season. November 2010: Sales peak, likely due to holiday shopping. December 2010: There is a sharp decline in sales, which might be due to the end of the holiday shopping period. Overall, the sales data shows a seasonal pattern with peaks around the holiday season and lower sales in the months following the holidays.
Sales Forecast With ARIMA model
This code fits an ARIMA model to the 'TotalSales' column of the 'monthly_sales' DataFrame, using the order (1, 1, 1) which specifies the parameters for the ARIMA model. It then forecasts the total sales for the next 12 months and stores these predictions in the 'arima_forecast' variable.
The ARIMA model has successfully forecasted the total sales for the next 12 months. Here are the predicted sales: January 2011: 368,761.49 February 2011: 381,223.25 March 2011: 382,111.29 April 2011: 382,174.57 May 2011: 382,179.08 June 2011: 382,179.40 July 2011: 382,179.43 August 2011: 382,179.43 September 2011: 382,179.43 October 2011: 382,179.43 November 2011: 382,179.43 December 2011: 382,179.43
ARIMA model's performance
The ARIMA model's performance analysis is as follows: Model Summary: Dependent Variable: TotalSales Number of Observations: 13 Model: ARIMA(1, 1, 1) Log Likelihood: -159.063 AIC: 324.126
BIC: 325.581 HQIC: 323.587 Covariance Type: opg Performance Metrics: Mean Squared Error: 28,514,153,247.34 Mean Absolute Error: 116,758.40 Root Mean Squared Error: 168,861.34 The model summary and performance metrics indicate the fit and accuracy of the ARIMA model on the sales data.
ARIMA Accuracy Level
The accuracy level of the ARIMA model, measured by the Mean Absolute Percentage Error (MAPE), is approximately 34.36%.
Explanation for Accuracy Level
The Mean Absolute Percentage Error (MAPE) of 34.36% indicates that, on average, the ARIMA model's predictions deviate from the actual values by 34.36%. This level of error suggests that the model's accuracy is not very high. In general, a lower MAPE value indicates a more accurate model. A MAPE of less than 10% is considered highly accurate, 10-20% is good, 20-50% is acceptable, and greater than 50% is poor. Therefore, a MAPE of 34.36% falls into the "acceptable" range but is not considered highly accurate.
Sensitivity Analysis on the ARIMA Model Parameters
To determine how changes in the order of the model affect the forecast accuracy.
The sensitivity analysis on the ARIMA model parameters shows the following Mean Absolute Percentage Error (MAPE) for different orders: ARIMA(1, 1, 1): MAPE = 34.36% ARIMA(2, 1, 2): MAPE = 34.29% ARIMA(0, 1, 1): MAPE = 34.47% ARIMA(1, 1, 0): MAPE = 37.59% ARIMA(2, 1, 0): MAPE = 34.06% The ARIMA(2, 1, 0) model has the lowest MAPE, indicating it provides the most accurate forecast among the tested models.SARIMA and compare their performance with the ARIMA model.
Sales Forecast With SARIMA Model
The SARIMA model's performance analysis is as follows: Performance Metrics: Mean Squared Error: 50,870,034,237.94 Mean Absolute Error: 135,157.76 Root Mean Squared Error: 225,543.86 Mean Absolute Percentage Error: 48.16% The SARIMA model has a higher Mean Absolute Percentage Error (MAPE) compared to the ARIMA model, indicating that the ARIMA model provides more accurate forecasts for this dataset.
Sales Forecast With Holt-Winter Model
The Holt-Winters model's performance analysis is as follows: Performance Metrics: Mean Squared Error: 14,097,225,979.37 Mean Absolute Error: 81,012.46 Root Mean Squared Error: 118,731.74 Mean Absolute Percentage Error: 21.87% Next, let's explore a machine learning model for forecasting.
Sales Forecast With RandomForestRegressor
The RandomForest model's performance analysis is as follows: Performance Metrics: Mean Squared Error: 22,749,084,001.48 Mean Absolute Error: 123,460.31 Root Mean Squared Error: 150,827.99 Mean Absolute Percentage Error: 23.52% Comparing the models: ARIMA: MAPE = 34.36%
SARIMA: MAPE = 48.16% Holt-Winters: MAPE = 21.87% RandomForest: MAPE = 23.52% The Holt-Winters model provides the most accurate forecast among the tested models.
Factors That Made The Holt-Winters Model The Best
Seasonality Handling: The Holt-Winters model is specifically designed to handle seasonality in time series data. Given that the sales data exhibits clear seasonal patterns, the Holt-Winters model can effectively capture and forecast these seasonal trends.
Trend Adjustment: The Holt-Winters model includes components for both trend and seasonality. This allows it to adjust for underlying trends in the data, providing more accurate forecasts compared to models that do not account for trends.
Lower Error Metrics: The Holt-Winters model achieved the lowest Mean Absolute Percentage Error (MAPE) among the tested models. This indicates that its forecasts are closer to the actual values, making it the most accurate model for this dataset.
Based on the analysis, the Holt-Winters model is recommended for forecasting sales data due to its ability to handle seasonality and trends effectively. It provides the most accurate forecasts among the tested models, as evidenced by its lower error metrics. For future forecasting, it is advisable to continue using the Holt-Winters model and periodically re-evaluate its performance as new data becomes available.
Sales Forecast By Holt-Winter Model
The plot shows the actual sales from December 2009 to December 2010 and the forecasted sales for the next 12 months using the Holt-Winters model. The forecasted sales are represented by the dashed line, while the actual sales are represented by the solid line. This visualization is comparing the model's predictions with the actual sales data.
Forecasted Sales
The insights from the predictions using the Holt-Winters model are as follows: 1. The forecasted sales for the next 12 months show a steady increase in total sales. 2. The model predicts a consistent upward trend in sales, indicating potential growth in the upcoming year. 3. The forecasted sales values are relatively stable, with no significant fluctuations, suggesting a period of steady sales performance. 4. The highest forecasted sales are expected towards the end of the year, which aligns with the historical trend of increased sales during the holiday season. These insights can help in planning inventory, marketing strategies, and resource allocation to capitalize on the predicted sales growth.
Recommendations
Based on the analysis and forecasts, here are some recommendations for decision-makers: Inventory Management: - Increase inventory levels in anticipation of the holiday season, as sales are expected to peak towards the end of the year. - Maintain a steady inventory throughout the year to meet the consistent demand predicted by the Holt-Winters model. Marketing Strategies: - Plan marketing campaigns around the holiday season to capitalize on the peak sales period. - Consider promotions and discounts during the months with lower sales to boost revenue. Resource Allocation: - Allocate more resources (e.g., staff, budget) during the peak sales months to handle increased demand. - Optimize resource allocation during the steady sales months to maintain efficiency. Product Focus: - Focus on the top-selling products identified in the analysis to maximize sales.
Summary
The analysis of the UK-based online retail company's sales data revealed several key insights: Monthly Sales Trends: Sales exhibit a seasonal pattern with peaks around the holiday season and lower sales in the months following the holidays. December 2009 and November 2010 had the highest sales. Top 10 Product Sales: The top-selling product was "WHITE HANGING HEART T-LIGHT HOLDER" with total sales of 52,879.37. The top 10 products significantly contributed to overall sales. Country Sales: The United Kingdom was the top-performing country with the highest total sales, followed by Germany, France, and the Netherlands. Region Sales: Europe was the leading region in terms of total sales, contributing significantly more than other regions. Data Cleaning: Missing values in the 'Description' and 'Customer ID' columns were replaced, duplicate records were removed, and outliers in the 'Quantity' and 'Price' columns were handled.
Sales Forecasting: Various models were tested for forecasting sales, including ARIMA, SARIMA, Holt-Winters, and RandomForest. The Holt-Winters model provided the most accurate forecasts with a Mean Absolute Percentage Error (MAPE) of 21.87%. Recommendations: Based on the analysis, it is recommended to increase inventory levels during the holiday season, plan marketing campaigns around peak sales periods, allocate more resources during high-demand months, and focus on top-selling products to maximize sales. .
Conclusion
The analysis of the UK-based online retail company's sales data provided valuable insights into sales trends, top-performing products, and regional sales performance. The data cleaning process ensured the accuracy and reliability of the analysis by addressing missing values, duplicates, and outliers.
Key findings include: 1. Monthly sales exhibit a seasonal pattern with peaks around the holiday season. 2. The top-selling product is "WHITE HANGING HEART T-LIGHT HOLDER," and the top 10 products significantly contribute to overall sales. 3. The United Kingdom is the top-performing country, with Europe being the leading region in terms of total sales.
For sales forecasting, various models were tested, and the Holt-Winters model was identified as the most accurate, with a Mean Absolute Percentage Error (MAPE) of 21.87%. This model effectively captures seasonality and trends, making it suitable for future sales predictions.
Recommendations for decision-makers include optimizing inventory management, planning marketing strategies around peak sales periods, and focusing on top-selling products to maximize revenue.
Overall, the analysis and forecasting provide a comprehensive understanding of sales performance and offer actionable insights for strategic planning and decision-making.