Coursework 2
Name: Mohamad Zgheib (H00456590)
Date: 18-03-2024
Introduction
This report analyzes a dataset that stems from a company's sales records. This is a comprehensive report which includes several steps aiming to visualize and interpret the dataset on hand. It gives insights, studies relationships, and implements different modeling methods which aim to further explain, examine, and recommend different strategies to enhance the overall state of the company.
Importing the Data
Run to view results
Cleaning the Data
Run to view results
No missing values are found in this dataset.
Run to view results
A few outliers can be seen:
1. The maximum value of the quantity column is 10000. This is significantly greater than the 75th percentile value, which is 5. 2. Also, the maximum value of the profit column is 8399.976. This is much greater than the 75% percentile value which is 29.364
These are just a few examples of the several outliers found.
We will perform a z-score normalization to eliminate the outliers. We will eliminate values that are more than 3 standard deviations away from the mean.
Run to view results
In this new 'data_no_outliers' data frame, outliers that are 3 standard deviations away from the mean were removed. The dataset without outliers has 9829 rows.
Calculating the Summary Statistics
Run to view results
Plotting the Data
Run to view results
The graphical analysis reveals:
Ship Mode: The dominant choice for shipping is Standard Class, with Same Day shipping being the least favored. This preference can be due to various reasons, including cost considerations. Standard shipping often presents a more cost-effective option compared to the expenses associated with the speed and logistics required for delivering orders on the same day.
Segment: Orders are mostly placed by the Consumer segment. Corporate and Home Office segments come in second and third place. This trend might be explained by the fact that consumers, who make up a large segment of the market, typically purchase items in smaller quantities but with greater frequency. In contrast, home offices and businesses are more inclined to place bulk orders, less often.
Category: The category with the most orders is Office Supplies, followed by Furniture and Technology. The high demand for office supplies is likely due to their quick consumption rate. These supplies include pens, papers, and other gadgets. On the other hand, furniture and technology items are purchased less as they are considered durable items.
Following this, we'll proceed to create scatter plots of 'Sales' vs 'Profit' as well as 'Quantity' vs 'Discount'.
Run to view results
Sales vs. Profit: A slightly positive correlation is found, however it is considered a weak one. Transactions with large sales don't always correlate with big profits. There are several scenarios where, despite increased sales, profits remain low or even negative. This situation could arise from varying costs of goods sold or substantial discounts applied.
Quantity vs. Discount: No correlation is seen between the two variables. This implies that the volume of products sold doesn't directly impact the level of discount offered. Moreover, it could be said that purchasing in large amounts does not necessarily guarantee a larger discount for the customer.
Modeling strategy
Utilizing a decision tree model stands out as an effective approach for identifying the primary influences on profit. This choice is due to the model's capacity to detect non-linear associations, an essential feature given that the relationship among 'Sales', 'Quantity', 'Discount', and 'Profit' could not be linear. Moreover, decision trees typically do not require preprocessing or implementing assumptions. Hence, this modeling strategy is great for this scenario.
Run to view results
The Decision Tree Regressor model shows that 'Sales' significantly influences 'Profit', having an importance rating near 0.54. 'Discount' comes next, with an importance rate close to 0.39, whereas 'Quantity' has the smallest influence, with an importance of 0.07.
This suggests that generating more profits might be best achieved through enhancing sales and applying strategic discounts. Implementing this strategy is better than increasing the number of items sold, as quantity has the least effect on profits.
In conclusion, based on these findings, it is recommended that the company prioritize sales improvement and discount strategies to become more profitable.
References
Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed.). O'Reilly Media.
Müller, A. C., & Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data Scientists. O'Reilly Media.
Chopra, S. (2019). Supply Chain Management: Strategy, Planning, and Operation (7th ed.). Pearson.