Coursework 2 C11BD
Djordje Ilic
H00452528
Introduction
The following report aims to analyse the operations of Superstore and gain insights on different aspects of the company's operations that have an effect on company's profitability. To cover this topic in a substantial way, Multiple Linear Regression model is applied in addition to Random Forest Regressor, to create a comparative view and choose the approach that is suited better for analysis of the company's profitability and factors that play a significant role in it.
Therefore, the report contains sections that cover the importation of data, data cleaning, visualisation of the elements regarded as important for answering of the profitability question and lastly, the models that are used to determine the key factors and their effects.
Importing the data
The data in question contains 29 columns with a variation of data types, including different types of identifiers (Row ID, Order ID and Customer ID), datetime data related to orders (Order Date and Ship Date), and information related to products which is numerical (Sales, Quantity, Discount, Profit) and nominal categorical (Product Name, Category, Ship Mode and Sub-Category). Furthermore, the data set contains boolean data regarding product returns, marking it as True or False (Ustundag et al.,2022).
Data Cleaning
The data cleaning process started with checking clarity of data and generally describing the data to get an overview. Thereafter the data was checked for duplicates and missing data, both of which resulted with 0. Lastly, the Interquantile method is used to detect and remove outliers, in order to prepare the data for interpretation. It is worth mentioning that a lenient IQR approach has been taken, given that 3 was chosen as multiplier after realisation that 1,5 removed a significant amount of data in this case. This raised concerns that some of the valid and needed data points might be erased from the data set together with outliers. In order to avoid the potential negative effect on predictive models, the more lenient multiplier was used, which resulted in removal of 1317 rows of data.
The figures show distributions in three numerical categories (segments) that directly affect profit, and profit. Profit boxplot shows the median value of profit on the horizontal line in the middle, while the outliers can be detected outside of this line. Furthest from the horizontal line lie the extreme outliers, which are possibly unusually large transactions or losses compared to rest of the data. The quantity boxplot indicates that most of the orders are in small quantities, while the outliers imply that some orders are of very high quantity compared to others. Sales boxplot shows a large number of outliers, with some extremely large transactions, while most sales values are concentrated towards the lower part. The blue part on Discount boxplot indicates that a significant portion of data lies around 0 % discount. However, there are outliers, indicating transactions with unusually high discount which may have affected the profit negatively.
Summary Statistics of Clean Data
Profit related segments:
The average profit in clean data set is 13.42$, with standard deviation of 26.8$. Furthermore, a large variation of values is detected, with minimum being -81$, and maximum being 112.2$ after outlier removal.
The average sales amount to 97.3$, with standard deviation of 134.2$. The sales range from 0,44$ to 783,96$, indicating that although extreme outliers have been removed, there is still significant variation in transactions.
The average quantity per transaction is 3.58 units with standard deviation of 2.10. Quantities sold range from 1 to 14. This means that majority of purchases include only a small number of units in the data set without outliers.
The average discount is 15% while standard deviation is 20%. Discounts range from 0 to 80% at maximum, which indicates that most of the discounts are relatively low, although there are instances where discounts are very high in the dataset even after removal of outliers.
Finally, the 8.1% of the products sold have been returned to the company, while 91.9% were kept by customers in the cleaned dataset.
In conclusion, the summary statistics imply that this is a normalised dataset, as most of the extreme values have been taken out of data, providing a good fundament for development of predictive models.
Plotting the data
The bar chart illustrates product categories' contribution to total profit. It is clear that the biggest profit contributor is the Office Supplies segment, followed by Technology. Furnitures is in the third place with approximately half of Technology value, whereas Technology is slightly above half of the value of Office Supplies. Given that the differences between the three categories are substantial, it shows that there is likely room for improvement in the Furnitures and Technology segments. It is worth mentioning that the spread among Office Supplies might not be that large, as there may be a small number of products that drive most of the profit. That is also something that should be examined and determined in order to support further strategic actions.
The Sales and Profit scatter chart shows the relationship between sales and profit with customer segments included for better understanding of where the profits are generated. There appears to be a positive correlation between sales and profit for most of the data points, indicating that an increase in sales tends to result in increased profits. However, this correlation seems to diminish in the higher sales value in some instances, showing that increased sales do not necessarily give higher profits, possibly because of discounts that are offered for higher quantities where the mark-up is lower in total.
Looking at customer segments, it is observable that all three segments contribute to both profit and loss-making transactions. The most frequent segment is the Consumer segment, which suggests that consumers have the largest number of transactions or that consumers are the largest segment in terms of sales volume. Moreover, Corporate and Home Office segments also show a wide distribution of sales and profit, but there appears to be fewer transactions compared to Consumer segment.
Interestingly, all three segments have sales that are associated with negative profit. This is an indication that company might have not priced their products correctly, or had high costs connected to these sales.
Modelling
Multiple Linear Regression Model
Multiple Linear Regression is an appropriate model to use in cases when it is assumed that the relationship between the independent variables and dependent variable is linear, meaning that changes in the independent variables result in proportionate changes in the dependent variable. In this case, the chosen independent variables are Sales, Quantity and Discount while the independent variable is Profit. In addition, one more variable is created, Sales/Quantity, in attempt to capture the average sale value per unit, which can help account for the bulk purchases' effect on the profit. Purchases in large quantities might be made with discounts, lowering the profit margins and thereby negatively affecting profit. By including this feature, the model will differentiate between high sales from bulk purchases and sales of fewer items at higher price (Kelleher et al., 2020).
The metrics MSE, RMSE and R^2 are selected to measure the average error magnitude, goodness of the fit and the scale of the errors in each of the independent variables units. In this way, the evaluation of the model can be performed based on the results.
The results of the model show that the performance is not on a high level. With MSE measuring the average squares of errors, that means that results are given in squared dollars in case of profit, which is not as practical as the other measures. RMSE being 21.6 means that standard deviation of prediction errors is 21.6$, in turn meaning, that the predictions are about 21.6$ away from actual profit figures, indicating imprecision. On the other hand, R^2 is 33,3%, meaning that approximately one third of variability in the profit can be explained by the model. This implies that the goodness of the fit is not high, and possibly more features should be added to capture more variability.
The figure illustrates the performance of the model, given that the diagonal line represents perfect prediction. Hence, the data points should ideally be aligned towards the line. However, the model shows a decent predictive capacity when it comes to points of profit value around 0 and up to 25$, and less capacity in the higher profit value.
Random Forest Regressor
Random Forest Regressor was chosen for the second model, due to its ability to handle non-linearity and therefore complement the Multiple Linear Regression model and its findings. Due to the lower R^2 score of the linear regression model, there is a reason to suspect that the relationship between variables is not linear (Tan et al., 2019).
For the evaluation, MSE, RMSE and R^2 are selected again, as a comparative analysis can be performed considering both models. Furthermore, the number of estimators is selected to be 100, as it serves as a good balance between having enough trees in the random forest and being computationally manageable number of trees, whereas random state is set at 42 randomly.
The results of the model indicate a better fit when compared to Multiple Linear Regression model. Firstly, MSE amounts to 282,48 compared to MLRs 466,78, meaning that Random Forest Regressor makes better predictions on average. Moreover, model's RMSE is 16.8. This value is relatively low, and means that model's predictions are on average 16.8$ away from actual profit. This is an improvement compared to MLR model, although Random Forest Regressor itself is not optimised. Finally, R^2 is 59.66%, implying that the model covers almost 60% of the variability. Therefore, 59.66% of changes in the profit can be accounted for by the model, indicating a moderate level of prediction.
The visual representation of Random Forest Regressor shows the relationship between predicted and actual profit. The diagonal line represents perfect prediction. Due to the large number of points, alpha = 0.5 is included in the chart for better transparency. There is a visible concentration of points around the line of perfect prediction, especially near the center where the actual profits are lower. This means that the model is more accurate at predicting for transactions with lower profits. Although similar to Multiple Linear Regression model, Random Forest Regressor is clearly more fit and therefore more reliable. However, the model still contains errors, as the points away from the line represent those errors. The distance between the line and the point represents how big the error is. It appears that the spread is wider as the profits increase, meaning that the model is less accurate for higher profit values.
Feature Importance
Lastly, Random Forest Regressor was used to determine feature importance when it comes to profit. As presented on the figure below, Sales has the greatest effect on profit with 46%, followed by Discount with 31%. Sales/Quantity comes in third place with 19.1%, further supporting the argument that bulk purchases affect the profit in the case of Superstore. Quantity comes in fourth place with 3.66%.
Recommendations:
On a strategic level, the company in question should leverage the insights gained from the models to increase its profits. Given that discounts have a large impact on profit, these should be controlled and preferably held low. Moreover, the focus should be on expanding Office Supplies, as well as improving Furniture by investigating why it is less profitable and how they could improve margins. Another possibility is to leverage Technology with other categories to drive sales.
The company should further optimise operational efficiency that drives the costs, and optimise pricing for better margins. Lastly, incorporate customer segmentation to tailor sales and marketing more effectively, thereby targeting the high value customer segment with personalised offers to maximise revenue.
Conclusion
In conclusion, the insights provided by the two models used in this report identify certain aspects in company's operations that have a strong effect on profitability, with sales and discounts being the two most influential. Although both models have room for improvement, the combination of linear and non-linear relationships between the key features helps understand what strategic adjustments the company should perform to increase profits. Namely, focus on most profitable product segment and improvement of the less profitable, systematic reduction of costs and focus on high value customer segments.
References:
Kelleher, J. D., Mac Namee, B., & D'arcy, A. (2020). Fundamentals of machine learning for predictive data analytics: Algorithms, worked examples, and case studies. MIT press. 2nd Edition
Tan, P. N., Steinbach, M., & Kumar, V. (2020). Introduction to data mining. Pearson Education. 2nd Edition, Global Edition
Ustudang, A.,Cevikcan, E., Beyca, O.F. (2019). Business Analytics for Professionals. Springer.