C11BD-CW2
Submitted by Riya Sinha, H00439236
Introduction
This analysis focuses on evaluating sales and profitability trends across various customer segments, product categories, and geographical regions, utilizing a dataset from a superstore. The methodology integrates descriptive statistics, exploratory data analysis, and machine learning models, namely Linear Regression and Random Forest, to predict profitability. Insights derived from this analysis are aimed at informing business strategies for enhancing profitability and identifying key drivers of sales performance.
Methodology
The methodology began with importing and cleaning the dataset, including handling missing values and outliers. Exploratory Data Analysis (EDA) was conducted to understand sales and profit trends, customer segments, product categories, and geographical performance through visualizations like box plots, line charts, and heatmaps. Two predictive models, Linear Regression and Random Forest, were then developed and evaluated based on Mean Squared Error (MSE) and R-squared (R²) to predict profitability. The Random Forest model showed a significant improvement over Linear Regression, indicating its better suitability for capturing the complexity of the dataset.
Importing the dataset
Let's start by loading the dataset and taking a quick look at its first few rows to understand its structure and identify the columns available.
Run to view results
Run to view results
Run to view results
The dataset contains various columns, including Row ID, Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer Name, Segment, and several others related to product details, sales, and returns.
Run to view results
Data Preparation and Cleaning
Here's a brief overview of the steps we'll take for data cleaning:
Run to view results
Run to view results
Run to view results
The conversion of Order Date and Ship Date to datetime format was successful, and the dataset's data types for these columns are now datetime64[ns].
Run to view results
At this point, the dataset is cleaned with respect to null values, duplicates, and date formatting.
Explorating Data Analysis
Box Plots
To check for outliers, we'll focus on the numerical columns: Sales, Quantity, Discount, and Profit. We can use box plots to identify outliers in these columns.
Run to view results
The box plots for Sales, Quantity, Discount, and Profit show a significant number of outliers, especially in the Sales and Profit columns. These outliers are points that fall well outside the upper and lower whiskers of the box plots, indicating that they are significantly higher or lower than the majority of the data points.
Interquartile Range (IQR)
After visualising, Interquartile Range (IQR) method is used to identify outliers numerically, we'll calculate the IQR for each of the specified variables: Sales, Quantity, Discount, and Profit (Tukey, J. W., 1977). Outliers are typically defined as data points that fall below Q1−1.5 X IQR or above Q3+1.5 X IQR where Q1 and Q3 are the first and third quartiles, respectively (Hubert & Vandervieren, 2008).
Run to view results
Removing Outliers
We will not remove outliers from 'Sales' and 'Profit' as it may result in the loss of key insights about best-performing categories or most profitable transactions. These outliers can also provide information about what works best for the business or help us identify greatest opportunites or challenges.
By only removing outliers from Discount and Quantity, we can clean data that supports accurate analysis and decision-making for typical business operations, while keeping the insightful data that extreme Sales and Profits values can provide.
Run to view results
Sales and Profit Trends Chart
To create a visualization showing sales and profit trends over time, we will use line charts (Few, S., 2009). Line charts are ideal for this purpose because they clearly depict how values change over a period, allowing us to observe trends, patterns, and fluctuations in sales and profits across the dataset's timeframe (Tufte, E. R., 2001).
Run to view results
It is observed that Profits remain positive throughout the period, which is a good sign of overall business health.
The lack of a direct proportionate increase between sales and profit indicates that higher sales do not always translate into equivalently higher profits. This could be due to various factors such as higher costs, discounting strategies, or a sales mix skewed towards less profitable products.
There are a few sharp spikes in sales that do not have corresponding spikes in profit. For instance, a spike in sales around the start of 2015 and mid-2016 does not result in a noticeable increase in profit. This proposes an investigation into sales activities or promotions that drive volume without enhancing profitability.
Profitablility Analysis
1. Average Profit by Category - To compare the average profit by product category.
Run to view results
Run to view results
As we analyse, Technology is the most profitable category with an average profit of $89.33, suggesting an increase in resources like inventory, marketing, or product expansion could be beneficial.
Office Supplies, with an average profit of $30.27, ranks second, over double Furniture's profit but far less than Technology. Exploring the product mix for higher-margin items could boost profitability.
Furniture has the lowest profit at $13.81, indicating a need to review pricing, costs, sales strategies, and competition.
2. Total Profit by Sub-Category:
Run to view results
Run to view results
The sub-categories 'Binders', 'Copiers', and 'Phones' are the top performers in terms of profit generation. 'Binders' lead with a significant margin, followed by 'Copiers' and 'Phones', which suggests that these areas are the most profitable for the business.
Several sub-categories such as 'Envelopes,' 'Art,' 'Labels,' 'Fasteners,' 'Bookcases,' 'Supplies,' and notably 'Tables' have low profitability. 'Tables' uniquely incur losses, significantly affecting total profit of -$15,822.
The profitability gap among sub-categories suggests differences in pricing, costs, demand, and competition. Addressing the loss in 'Tables' is crucial for enhancing overall profitability.
Customer Segment Analysis
1. To represent the proportion of sales coming from different customer segments with the help of pie chart.
2. To identify which customer is contributing the most to the sales through bar chart.
Run to view results
Run to view results
The Consumer segment accounts for the majority of sales with 50.1%, Corporate comes next with 31.2%, and Home Office has the smallest share at 18.7%. The overall total sales amount to $2,131,828.31. This distribution indicates that marketing and sales efforts might be most effectively targeted at the Consumer segment, given its dominance in sales contribution.
Run to view results
Run to view results
The bar chart showcases the top 10 customers by total sales, with Sean Miller generating the highest sales at $24,943.46, and Todd Sumrall the least within this group at $11,681.77. The sales values are closely grouped, indicating a relatively consistent spending pattern among these top customers.
Geographical Analysis
1. Average Profit by Region: This part of the analysis would provide insights into which regions are the most and least profitable.
2. Average Profit by City: Similar to the regional analysis, examining average profit by city would highlight specific urban markets where the business is performing well versus those where profitability is lagging.
By comparing average profits across regions and cities, the business can identify patterns and potential causes behind regional performance differences.
Run to view results
Run to view results
Profit margins are relatively consistent across regions, with the East leading slightly at an average profit of $40.59. The South has the lowest average profit at $36.85, which suggests room for improvement or different market conditions.
Run to view results
Among cities, Jamestown stands out with a significantly higher average profit of $642.89, which is remarkably greater than the others. Independence and Lafayette also show substantial average profits. These cities could be key strategic areas due to their high profitability. Minneapolis and Appleton follow with strong average profits as well, indicating they are also important markets for the business.
Category-Geography Heatmap - It depicts average sales by product category across different regions.
Run to view results
Discount Impact Analysis
To analyze the impact of discount levels on sales and profitability, we'll use scatter plots (Wilkinson, L., 2005). Scatter plots are ideal for this analysis as they can visually depict the relationship between two variables, in this case, discount levels and either sales or profits. We'll create two scatter plots:
Run to view results
Discount vs. Sales We see a wide distribution of sales values across all discount levels, with no clear upward or downward trend as the discount increases. This suggests that discounts do not have a straightforward effect on increasing sales. High sales values are seen across various discount levels, including lower discounts. Notably, there aren't many instances of very high sales at the highest discount levels (around 0.5), which could imply that larger discounts do not necessarily lead to proportionately larger sales volumes.
Discount vs. Profit The occurrence of negative profit (losses) seems to be more frequent and more severe as the discount approaches 0.5. This indicates that higher discounts are associated with lower, often negative, profits. The presence of many data points at lower discounts with positive profits suggests that moderate discounting can be sustainable, but there is a threshold beyond which discounts are likely damaging profitability.
The business may benefit from finding the optimal discount rate that maximizes sales while maintaining a healthy profit margin.
Modelling Strategy
Linear Regression
We'll use a linear regression model as our starting point to predict Profit based on the identified variables.
Run to view results
The initial regression model, designed to predict profitability based on sales, discount, segment, and product categories, has been evaluated and shows the following performance metrics on the test set:
Mean Squared Error (MSE): 21653.28 R-squared (R²): 0.63
The MSE is a measure of the average squared difference between the actual and predicted values, indicating the prediction error's magnitude (Willmott & Matsuura, 2005). A lower MSE indicates a better fit to the data. In our case, the MSE value suggests that the model's predictions deviate from the actual profits by a relatively large margin on average.
R² provides a measure of how well the variability of the dependent variable is explained by the model. An R² of 0.63 means that approximately 63% of the variability in the dependent variable can be explained by the model. This suggests that the model has a moderate level of predictive power.
While this is a positive indication that some relationship has been captured, a relatively moderate value of R² value also suggests there's a room for improvement in model performance. Thus, we will explore more complex models beyond linear regression, such as random forests which might capture non-linear relationships more effectively (Breiman, L., 2001).
Random Forest Regressor
We will now implement and evaluate a Random Forest model using our dataset.
Run to view results
The Random Forest model shows significantly improved performance metrics on the test set compared to the initial linear regression model:
Mean Squared Error (MSE): 7,999.03 R-squared (R²): 0.86
These results suggest a substantial improvement in model performance. An MSE of approximately 7,999.03 indicates a lower average error in the profit predictions, which means the Random Forest model is quite effective in capturing the relationships between the variables and the profitability of the business.
An R² value of 0.86 is particularly noteworthy. It indicates that approximately 86% of the variance in profitability can be explained by the model's inputs, showcasing a strong predictive power of the Random Forest model on this dataset.
Interpretation
To make informed predictions about the profitability of different products and customer segments using the Random Forest model we've developed, following steps have been followed:
1. Consumer Segment, Office Supplies, Sales: $500, Discount: 10% 2. Consumer Segment, Technology, Sales: $1000, Discount: 5% 3. Consumer Segment, Furniture, Sales: $750, Discount: 15%
In future, this could be for new products you're considering introducing, existing products in new segments, or hypothetical scenarios to understand potential profitability impacts.
Run to view results
Next Steps
Review the Results: Pay attention to any predicted profitability values, especially look for extremes (very high or very low values) that might indicate particularly profitable or unprofitable products or segments.
Analysis and Insights: Derive insights from these predictions. Are there any patterns or trends in the data? Which customer segments or product categories appear most profitable? Are there any surprises or unexpected results that could lead to new questions or investigations?
Business Strategy: Consider how these insights can influence your business strategy. This might involve adjusting product offerings, reevaluating pricing strategies, focusing marketing efforts on certain segments, or even exploring new market opportunities.
Communicate Findings: Share the insights with relevant stakeholders in your organization. Use the data to support your recommendations for strategic changes or further investigations.
Conclusion
The analysis revealed the Random Forest model as a superior predictor of profitability, achieving an R² of 0.86, suggesting a strong ability to explain the variance in profitability. The exploration highlighted the impact of customer segments, product categories, and discounts on sales and profits. For instance, technology products emerged as highly profitable across regions, and discounts were found to have a nuanced impact on profitability. These insights are crucial for strategic decision-making, guiding the superstore in optimizing product offerings, marketing strategies, and discounting practices to enhance overall profitability.
References
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Few, S. (2009). Now you see it: simple visualization techniques for quantitative analysis. Analytics Press.
Hubert, M., & Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Computational Statistics & Data Analysis, 52(12), 5186-5201.
Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.
Tufte, E. R. (2001). The visual display of quantitative information (Vol. 2, No. 9). Cheshire, CT: Graphics press.
Wilkinson, L. (2005). The grammar of graphics. Springer Science & Business Media.
Willmott, C. J., & Matsuura, K. (2005). Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate research, 30(1), 79-82.