INTRODUCTION
The document explores and models a sales and profit margin dataset through an extensive data analysis process that makes use of Python libraries like pandas, numpy, and sklearn. The first steps in the analysis, are importing and cleansing the data, handling outliers, looking for duplicates, and getting the dataset ready for modelling. After that, the data is visualized using bar charts and scatter plots to obtain a deeper understanding of the distribution of profits and the correlations between variables such as profit and sales. After training a Linear Regression model to forecast profit, its performance is assessed using metrics such as Mean Absolute Error and Root Mean Squared Error. Finally, the importance of features affecting profit is determined through the coefficients of the model, highlighting factors like discounts and product categories that significantly impact profit margins.
IMPORTING DATA
Run to view results
DATA CLEANING
Run to view results
The dataset seems to not contain any null values. Now, let's check for duplicates and also we will handle any errors that might be in the dataset. After that, we'll deal with potential outliers in our data. These may be valid data points, but they're worth investigating further to be sure. Outliers in this dataset could be seen as large orders or amounts of returns that might distort the data. One way to handle these outliers could be to remove data that falls above a certain standard deviation threshold. Looking at outliers in the 'Sales', 'Quantity', 'Discount', and 'Profit' columns could be particularly helpful in this dataset since these all directly implicate the amount of profit. We'll consider data as outliers if they fall outside of 3 standard deviations below or above the mean (Hoaglin, 1986). This is a common method of defining outliers statistically which makes it a sensible selection for this
Run to view results
Run to view results
Run to view results
OUTLIERS DETECTION
The presence of negative profits even when the product was not returned might seem unusual, but it could be due to costs surpassing the revenue for particular transactions, possibly from discounts or the cost price being higher than the sale price. Hence, even these entries could be valid and we would not consider them as errors in the dataset. Next, let's go ahead and handle the outliers in the data. We remove outliers in Profit, Sales and Quantity which are beyond 3 standard deviations from the mean because they might grossly affect our internal models, but these cases could be considered as unusual high profit/loss cases which might be of specific interest and should be analyzed separately. We'll replace these outlier instances with corresponding upper/lower bounds (i.e., mean ± 3*
Run to view results
Now, we have cleaned our data. We have dealt with any potential errors in the dataset and also removed the outliers which were beyond 3 standard deviations from the mean. Let's now move on to explore our data further. Some useful statistics to consider include measures of central tendency (like mean and median) and measures of spread (like variance and standard deviation). Since 'Profit' is the outcome, we'll get the summary statistics of it, along with 'Sales', 'Quantity', and 'Discount' as these directly relate to
SUMMARY STATISTICS
Run to view results
The summary statistics provide good insights about data distribution. For instance, 'Profit' has a mean value of around 26 USD, and profits can range from -653.28 USD to around 726.56 USD. Similarly, 'Sales' varies from 0.44 to 2079.40, and the 'Quantity' sold ranges from 1 to 14 units per transaction. The 'Discount' given ranges from 0 to 1.3 (exceeding one is likely incorrect, as the discount should be within 0 to 1). Next, let's visualize our data which can provide us with another layer of understanding. We will have one categorical plot (bar chart) and one continuous plot (scatterplot). For example, the bar plot could represent the total profit per 'Segment' category and the scatterplot could help us understand the relationship between 'Sales' and 'Profit'. Let's proceed to building these plots.
DATA PLOTTING
Run to view results
The first plot, a bar chart, demonstrates that the 'Consumer' segment is the most profitable barring any major outliers that could potentially skew the total profit in its favor. The scatter plot between 'Sales' and 'Profit' doesn't reflect a clear correlation between the two variables. We can observe instances where despite higher sales, the profit is low or even negative, and vice versa. This could be due to factors such as discount, quantity of products sold, segment of product, and more. Previous studies have shown that other factors including revenue margin as well as operational efficiency other than sales volume can influence profit margin (Jones et al, 2020). Next, we will proceed to the modeling task to understand which features contribute most significantly to profit. As 'Profit' is a continuous variable, we will use a regression model to predict it, specifically a simple Linear Regression because of its simplicity and interpretability. Once the model is built, we will consider the feature importance which will help us understand which variables are contributing the most to the profit. Before we proceed with model training, we need to prepare our data. Firstly, we'll create dummy variables for the categorical fields using one-hot encoding and divide our data into features (X) and target (y=i.e, 'Profit') datasets. Then, we'll split this further into training and testing datasets such that the model can be tested (James et al, 2013).
Run to view results
LINEAR REGRESSION
Now we've prepared our data. We've separated the features (X) from the target variable (Profit) and also divided our datasets into training and testing sets. Next, we need to train a Linear Regression model on our training set. After we've trained our model, we can use it to make predictions using our testing set and calculate error metrics, such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE), to evaluate the performance of our model. Let's proceed to the model (James et al., 2013).
Run to view results
The model performance metrics tell us that on average, our predictions were about $35.33 off from the actual profit values (since we're predicting profit, this is in USD). The root mean squared error (RMSE) being larger than the mean absolute error (MAE) indicates that we have a number of bad predictions that are leading to a higher cumulative square error. But more importantly, let's look at the coefficients of our linear model to ascertain which features in our data most significantly affect 'Profit'. The coefficients can help us understand the relationship that each feature has with our target variable. A positive coefficient indicates that as the value of that feature increases, so does the target and vice versa for negative coefficients. The magnitude of the coefficient indicates the size of the effect that feature has on our target. Let's calculate these
Run to view results
INTERPRETATION OF LINEAR REGRESSION
The output shows the coefficients sorted in descending order of their absolute values. The larger the absolute value of the coefficient, the higher the importance and impact of that feature on the profit margin. The 'Discount' feature has the highest negative effect meaning higher discounts could be leading to lower profit margins. This makes sense as high discounts reduce the final sale price hence reducing profit margins. The 'Sub-Category Copiers' has a significant positive impact on profit, indicating sales of Copiers tend to bring in more profit. On the other hand, 'Sub-Category Tables' have a negative influence on profit. Sales is not among the prominent features, which could be because the correlation between 'Sales' and 'Profit' may not be that straightforward and can be influenced by other factors such as 'Discount', 'Sub-Category', 'Category no' etc. Therefore, while making strategies to increase profits, factors like offering less discounts especially in the key Sub-Categories such as Copiers, Tables, etc. could be taken into consideration. Efforts could also be made to increase sales in profitable sub-categories. Please note, linear regression assumes a linear relationship between features and the target and also that the features are independent. If these assumptions are violated, it could lead to incorrect interpretations. In reality, relationships could be much more complex and nonlinear in nature, hence more sophisticated models might be needed for a detailed and realistic prediction.
CONCLUSION
After studying the data, it became obvious that despite the fact that the group 'Customer' was responsible for gross margin the most, there were certain things like the quantity of products sold, discounts offered, and the kind of division that played a role in sales, and influenced the resulting profit. Thus, we formulated Linear Regression model to help us predict profit by considering each factor separately. Interestingly, on average the model resulted in deviating slightly by $35.33 when gathering profit.
Also noteworthy; the ratio of discounts made to profit expectation was the most influential in terms of predicting profit. One this is for sure- the more discounts you give out, the more your bottom line suffers exactly. This clearly demonstrates how monitoring deep discount levels is a major concern, which we should always keep in mind if we want to keep on making money.
Here eventually, is where our data study provided us with some facts that have been strong. It demonstrated not only the value of getting the data right but also of dealing with any 'wild outliers', graphing to see things clearly, chucking in some fancy math like modeling that enhances our interpretation, and eventually which allows us to run an efficient business.
RECOMMENDATION
Based on the analysis conducted in the report exploring a sales and profit margin dataset, several recommendations can be made to enhance business performance and profitability:
Data Cleaning and Outliers Handling: Accentuate the investigation of negative profits without any turnover attempt to know the driving reason for such. Data can be scrutinized by isolating sales, quantity, discount, and profit periodically to reflect data authenticity and trustworthiness.
Feature Importance: Set the attention on things that matter in terms of the net profit margins on discounts and categories of products which are most significant to the model with linear regression. Visits those high profit or loss scenarios individually, you will have somewhat of clues or trends that may cause the cases and which leads to opportunities in the scenario.
Data Visualization: Make the use of charts including the bar charts, scatter plots to determine how profit is distributed especially in labor and machinery, how the variables are correlated and which factors affect the profit. Try more specifically focused visualizations to explore hidden patterns and relationships within data.
Modeling and Prediction: Additionally, you should get your Linear Regression model perfect by adding more influence factors or trying complex models for better results. Assess model performance routinely using metrics such as Mean Absolute Error, Mean Squared Error, and Root Mean Squared Error, in order to determine the model accuracy.
Business Strategy: How can we find the strategy that is based on feature importance identification in order to raise the cap of the profit? Apply dynamic pricing that takes into account product makes and issuing of discounts in order to raise the profitability.
Continuous Improvement: Set up a data analysis and refinement cycle along with an approach to updating the plan through modeling whenever some unfavorable1 market conditions occur. Develop ongoing employee training by involving data analysts in learning new data manipulation techniques in a reporting process.
Implementing the above strategies will allow companies to gain in using data-driven insights to make smart decisions/optimize operations and make the revenue increase/profitability.