# CourseWork 2 - H00444598

### Introduction

This study provides an in-depth analysis of predictive models developed to better understand the factors that influence profit within a dataset. Exploratory data analysis, feature engineering, and the use of Linear Regression and Random Forest Regressor models reveal insights into how variables like sales, quantity, and discounts affect profit. Regulatory approaches include modifying the profit variable to handle negative values and standardizing features to improve model performance. Furthermore, the report provides a visual representation of model performance, as well as the importance and influence of certain factors that contribute to improving profit prediction. The analysis serves as the foundation for strategic decision-making, to increase profitability through informed, data-driven insights.

### Importing Data

Run to view results

## Data cleaning

In the data-cleaning process for this analysis, the following steps have been undertaken:

These cleaning and preprocessing steps are essential for optimizing the dataset for modeling, helping to enhance model performance and ensure more accurate predictions.

### Checking for missing Values

Run to view results

### Checking for Duplicates

Run to view results

### Correcting all the data types

Run to view results

### Removing Extreme Outliers

Run to view results

Run to view results

### Summary of the Statistics

Run to view results

Run to view results

### Summary

The summary statistics of the dataset after removing 1546 outliers based on extreme conditions for `Quantity`, `Discount`, `Sales`, and `Profit` provide insightful details:

These cleaned summary statistics suggest a more normalized dataset by trimming the data points that are exceedingly far from the median, which allows for more accurate analysis and modeling without the undue influence of extreme values.

### Bar Charts for Profitability

Run to view results

The generated graph presents a clear visual analysis of all the categories of different products contributing to the total profit within the dataset.

### Scatter Plot:

Run to view results

Sales vs. Profit:

The scatter plot "Sales vs Profit" visually represents the relationship between sales and profit:

This plot serves as a valuable tool for identifying trends, pinpointing outliers, and informing decisions related to sales strategies and product profitability.

### Regression Model:

### Linear Regression Model and Random Forest Regressor

Run to view results

Run to view results

### Adopted Approaches for Modeling

Feature Engineering:

Data Scaling:

A logarithmic transformation “Profit”:

### Model Selection

Linear Regression and Random Forest Regressor: These models are chosen for their complementary strengths. Linear Regression, a simple model, offers transparency and fast training times, making it a natural starting point. It's best suited for datasets where relationships between variables are linear. On the other hand, Random Forest, an ensemble of decision trees, can capture non-linear relationships and interactions between features without explicit feature engineering, offering a more powerful, if less interpretable, alternative.

The evaluation metrics and feature importances for both Linear Regression and Random Forest Regressor models offer insight into the dataset's characteristics and the models' performance:

Linear Regression:

Random Forest Regressor:

### Recommendations

Based on the analysis performed using Linear Regression and Random Forest Regressor models, as well as the insights from the visualization of sales vs. profit, here are some recommendations that could potentially improve profitability and operational strategies within the dataset context:

### Conclusion

### References

Grömping, U. (2009) ‘Variable importance assessment in regression: Linear regression versus Random Forest’, The American Statistician, 63(4), pp. 308–319. doi:10.1198/tast.2009.08199.

Gregorutti, B., Michel, B. and Saint-Pierre, P. (2016) ‘Correlation and variable importance in random forests’, Statistics and Computing, 27(3), pp. 659–678. doi:10.1007/s11222-016-9646-1.

Ertekin, N., Shulman, J.D. and Chen, H. (Allan) (2019) ‘On the profitability of stacked discounts: Identifying revenue and cost effects of discount framing’, Marketing Science, 38(2), pp. 317–342. doi:10.1287/mksc.2018.1137.