CourseWork 2 - H00444598
Introduction
This study provides an in-depth analysis of predictive models developed to better understand the factors that influence profit within a dataset. Exploratory data analysis, feature engineering, and the use of Linear Regression and Random Forest Regressor models reveal insights into how variables like sales, quantity, and discounts affect profit. Regulatory approaches include modifying the profit variable to handle negative values and standardizing features to improve model performance. Furthermore, the report provides a visual representation of model performance, as well as the importance and influence of certain factors that contribute to improving profit prediction. The analysis serves as the foundation for strategic decision-making, to increase profitability through informed, data-driven insights.
Importing Data
Run to view results
Data cleaning
In the data-cleaning process for this analysis, the following steps have been undertaken:
These cleaning and preprocessing steps are essential for optimizing the dataset for modeling, helping to enhance model performance and ensure more accurate predictions.
Checking for missing Values
Run to view results
Checking for Duplicates
Run to view results
Correcting all the data types
Run to view results
Removing Extreme Outliers
Run to view results
Run to view results
Summary of the Statistics
Run to view results
Run to view results
Summary
The summary statistics of the dataset after removing 1546 outliers based on extreme conditions for `Quantity`, `Discount`, `Sales`, and `Profit` provide insightful details:
These cleaned summary statistics suggest a more normalized dataset by trimming the data points that are exceedingly far from the median, which allows for more accurate analysis and modeling without the undue influence of extreme values.
Bar Charts for Profitability
Run to view results
The generated graph presents a clear visual analysis of all the categories of different products contributing to the total profit within the dataset.
Scatter Plot:
Run to view results
Sales vs. Profit:
The scatter plot "Sales vs Profit" visually represents the relationship between sales and profit:
This plot serves as a valuable tool for identifying trends, pinpointing outliers, and informing decisions related to sales strategies and product profitability.
Regression Model:
Linear Regression Model and Random Forest Regressor
Run to view results
Run to view results
Adopted Approaches for Modeling
Feature Engineering:
Data Scaling:
A logarithmic transformation “Profit”:
Model Selection
Linear Regression and Random Forest Regressor: These models are chosen for their complementary strengths. Linear Regression, a simple model, offers transparency and fast training times, making it a natural starting point. It's best suited for datasets where relationships between variables are linear. On the other hand, Random Forest, an ensemble of decision trees, can capture non-linear relationships and interactions between features without explicit feature engineering, offering a more powerful, if less interpretable, alternative.
The evaluation metrics and feature importances for both Linear Regression and Random Forest Regressor models offer insight into the dataset's characteristics and the models' performance:
Linear Regression:
Random Forest Regressor:
Recommendations
Based on the analysis performed using Linear Regression and Random Forest Regressor models, as well as the insights from the visualization of sales vs. profit, here are some recommendations that could potentially improve profitability and operational strategies within the dataset context:
Conclusion
References
Grömping, U. (2009) ‘Variable importance assessment in regression: Linear regression versus Random Forest’, The American Statistician, 63(4), pp. 308–319. doi:10.1198/tast.2009.08199.
Gregorutti, B., Michel, B. and Saint-Pierre, P. (2016) ‘Correlation and variable importance in random forests’, Statistics and Computing, 27(3), pp. 659–678. doi:10.1007/s11222-016-9646-1.
Ertekin, N., Shulman, J.D. and Chen, H. (Allan) (2019) ‘On the profitability of stacked discounts: Identifying revenue and cost effects of discount framing’, Marketing Science, 38(2), pp. 317–342. doi:10.1287/mksc.2018.1137.