Profit Optimization through Data Analytics: Superstore Case Study
C11BD Coursework 2
By Aparna Yellamraju
H00454947
Introduction:
Aim The primary aim of this analysis is to understand the factors that contribute to increasing the profitability of the Superstore company.
Objectives: First, the provided dataset will be imported and data cleaning will be performed to check if the dataset contains errors and outliers. Data cleaning involves identifying and removing any erroneous data entries. Additionally, the presence of outliers, which are extreme values that deviate significantly from the rest of the data, will be addressed.
Once the data is cleaned, summary statistics will be calculated to gain a comprehensive understanding of the dataset's characteristics, including measures of central tendency, dispersion, and the distribution of variables.
To facilitate a visual exploration of the data, plots and charts will be created such as a bar chart and scatter plot are created to visualize and analyze the variables such as profit, subcategories and sales.
The culmination of the analysis will be the implementation of a modeling strategy to identify the factors that contribute most significantly to the Superstore's profitability. Modelling techniques like regression analysis specifically linear and logistic regression are employed.
Finally, and most crucially, the results of the chosen model will be interpreted, and detailed comments will be provided to explain the findings to the client. These comments will highlight the key factors influencing the Superstore's profitability, enabling them to make informed decisions and implement strategies to increase their profits based on the analysis.
Data Import:
Importing data from the given dataset - Superstore.csv and loading it to dataset and then displaying it into rows and columns for further analysis.
Missing value analysis :
The code given below will assess for any missing values in the given dataset "Superstore.csv"
The dataset appears to be clean and well-structured, with no missing values and appropriate data types assigned to each column. These initial preprocessing steps lay a solid foundation for further analysis and modeling. However, additional steps such as outlier detection and data entry error checks may be necessary to ensure data quality and reliability for advanced analytics tasks.
Detecting the outliers:
Outliers refer to data points that significantly differ from other observations in a dataset. Identifying outliers is crucial in data analysis because they can skew statistical analyses and distort interpretations. Therefore, detecting and appropriately handling outliers is essential for ensuring that there are no inaccurate summaries of the data.
Analysis: Sales Outliers: The high number of sales outliers (1167) suggests significant variability in sales data, indicating potential anomalies or extreme values affecting sales performance. Profit Outliers: With 1881 profit outliers, there are substantial deviations in profit margins, highlighting areas where profitability may be significantly impacted by unusual data points. Quantity Outliers: The presence of 173 quantity outliers indicates instances where order quantities deviate significantly from the norm, potentially affecting inventory management or sales patterns. Discount Outliers: The detection of 858 discount outliers suggests irregularities in discount rates applied, which could impact revenue generation or pricing strategies.
Strategy for removing the extreme outliers
The Interquartile Range (IQR) method is used primarily in statistics and data analysis for assessing the spread or variability of a dataset. It's particularly useful when dealing with skewed data or datasets containing outliers.
The IQR method involves dividing the data into quartiles (Q1, Q2, Q3) and calculating the IQR as the difference between Q3 and Q1. It defines a "fence" outside of Q1 and Q3, where any values falling outside this fence are considered outliers. Outliers are identified as data points that lie below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.
Implementation:
To apply the IQR method for outlier detection and removal: 1.Find Q1 (first quartile) and Q3 (third quartile). 2.Calculate the IQR as IQR = Q3 - Q1. 3.Define the normal data range with lower limit as Q1 - 1.5 * IQR and upper limit as Q3 + 1.5 * IQR. Any data point outside this range is considered an outlier and should be removed for further analysis.
Removing extreme outliers:
Analysis:
The removal of extreme outliers using the modified IQR method has significantly reduced the dataset size from 9994 to 8451 records. This indicates that there were 1543 data points identified as extreme outliers and subsequently removed from the dataset. Removing extreme outliers can help improve the quality of data analysis by reducing the impact of skewed or erroneous data points on statistical measures and model performance.
Calculating summary statistics
Summary statistics offer a snapshot of essential information about a dataset, and includes measures like the mean, median, mode, standard deviation, range, quartiles, and other statistical parameters that summarize key aspects of the data.
Analysis: The dataset contains 8451 records after outlier removal. Sales, quantity, discount, and profit exhibit variability in their values. Central Tendency: The mean 'Sales' value is approximately $90.99, with a minimum of $0.44 and a maximum of $783.96. The average 'Quantity' sold is around 5.93 units per transaction. The mean 'Profit' per transaction is $12.49, with values ranging from negative to positive. Variability: 'Sales' and 'Profit' show considerable variability as indicated by their standard deviations. The range of 'Quantity' sold is wide, with a maximum value of 10000 units. Proportion of Returns: Approximately 8.14% of items were returned ('True'), while 91.86% were not returned ('False'). High sales values and negative profit instances must be analyzed to identify areas for improvement to understand contributing factors.
Plotting the Graphs:
Analysis:
Boxplots are useful for visualizing the distribution and identifying potential outliers within each variable.
Quantity:
The boxplot for 'Discount' displays the distribution of discount rates applied to sales transactions. Similar to 'Quantity', the boxplot shows the median, interquartile range, and potential outliers. Outliers in the 'Discount' variable might suggest unusual discounting practices or occasional extreme discount rates applied to certain transactions. Sales Distribution:
The boxplot for 'Sales' visualizes the distribution of sales amounts. Analysis of this plot can reveal the spread of sales values, central tendency, and presence of outliers. Outliers in 'Sales' could indicate exceptionally high-value transactions or unusually low-value sales. Profit Distribution:
The boxplot for 'Profit' represents the distribution of profit amounts. Similar to 'Sales Distribution', it shows the central tendency, spread, and potential outliers in profit values. Outliers in 'Profit' might highlight transactions with exceptionally high or low profit margins.
Further analysis, such as examining the relationship between these variables and other features in the dataset, could provide a more comprehensive understanding of the factors influencing profitability and sales performance.
Logistic Regression
Analysis:
Binary Target Variable Creation:
The code creates a binary target variable named 'Profitable', where the value is 1 if the profit is greater than 0 and 0 otherwise. This binary classification task aims to predict whether a sales transaction results in profitability. Feature Selection:
Features selected for the model are 'Sales', 'Quantity', and 'Discount'. These features are used to predict the target variable 'Profitable'. Data Splitting:
The dataset is split into training and testing sets with a 80-20 ratio. This ensures that the model is trained on a subset of the data and evaluated on unseen data to assess its generalization performance. Logistic Regression Model:
Logistic Regression is chosen as the classification algorithm for this task. It's a commonly used algorithm for binary classification tasks, and it's suitable for predicting probabilities associated with class membership. Model Fitting:
The logistic regression model is fitted to the training data using the fit() function. This process involves learning the coefficients of the logistic regression equation that best fit the training data. Predictions:
The trained model is used to make predictions on the test set using the predict() function. Additionally, probabilities of belonging to the positive class ('Profitable') are predicted using predict_proba(). Evaluation Metrics:
The accuracy of the model on the test set is calculated using the score() method, which compares the predicted labels with the actual labels. Confusion matrix, classification report, and ROC AUC score are computed to provide a more detailed evaluation of the model's performance. These metrics assess the model's ability to correctly classify instances into true positives, true negatives, false positives, and false negatives. Output Analysis:
The achieved accuracy on the test set is approximately 92.96%, indicating that the model performs well in predicting the profitability of sales transactions. The confusion matrix shows the distribution of true positive, false positive, true negative, and false negative predictions. The classification report provides precision, recall, and F1-score for both classes (profitable and non-profitable). The ROC AUC score of approximately 0.948 suggests that the model has good discriminative ability in distinguishing between profitable and non-profitable sales transactions. Overall, the analysis indicates that the logistic regression model trained on 'Sales', 'Quantity', and 'Discount' features performs well in predicting the profitability of sales transactions, achieving high accuracy and good discriminative ability.
Scatter Plot
Based on the provided scatter plot and the code, the relationship between the product sub-categories and profit in the Superstore dataset can be analyzed.
The scatter plot displays the profit (y-axis) for each product sub-category (x-axis). Each data point represents a specific sub-category and its corresponding profit value.
Observations from the scatter plot:
Profit Distribution: The scatter plot shows a wide range of profit values, with both positive and negative profits for different sub-categories. This indicates that some sub-categories like chairs, copiers, bookcases and phones are profitable, while others such as envelopes, appliances and art are potentially loss-making for the company. Outliers: There are several sub-categories such as chairs and copiers with extremely high profit values, which appear as outliers in the plot. These could be particularly successful or high-margin product sub-categories that contribute significantly to the company's overall profitability. Negative Profits: Several sub-categories like envelopes, appliances and art have negative profit values, suggesting that these product lines may be underperforming or incurring losses for the company. These sub-categories could be candidates for further analysis and potential optimization or discontinuation. Clustering: While the data points are scattered, there appears to be some clustering or grouping of sub-categories around certain profit ranges. For example: Sub-categories like "Machines", "Binders", and "Storage" seem to cluster around a profit range of approximately $0 to $1,000. Sub-categories like "Fasteners", "Supplies", and "Labels" cluster around a profit range of approximately $0 to $500. Sub-categories like "Accessories" and "Tables" cluster around a profit range of approximately $0 to $1,500. This could indicate similarities in profitability among related product sub-categories.
Analysis:
Overall Trend: The scatter plot reveals a positive correlation between sales and profit. As sales increase, profits generally tend to increase as well, although there is significant variation in the data. Clustering: There is a dense cluster of data points in the lower sales range, forming a triangular shape. This suggests that a large number of transactions or product lines have relatively low sales volumes but varying levels of profitability, including some with negative profits. Spread at Higher Sales: As sales increase, the data points become more spread out vertically, indicating a wider range of profitability for higher sales levels. This could be due to factors such as differences in cost structures, pricing strategies, or product mixes. High Sales Outliers: There are a few outliers with extremely high sales values. Some of these outliers have very high profits, while others have negative profits. These outliers may represent unique or exceptional transactions that merit further investigation to understand the factors contributing to their profitability or losses. Negative Profits: The scatter plot shows several data points with negative profit values, indicating that some sales transactions resulted in losses rather than profits. These could be due to factors such as high costs, pricing issues, or other operational inefficiencies. Overall Variability: While there is a positive correlation between sales and profit, the scatter plot also reveals a considerable amount of variability, especially at higher sales levels. This suggests that other factors beyond just sales volume may be influencing profitability, such as product mix, marketing strategies, or operational efficiencies.
Analysis: The average profit is calculated for each sub-category by grouping the data on the "Sub-Category" column and taking the mean of the "Profit" column.
From the bar graph, following analysis can be made:
Sub-category ranking: The sub-categories are sorted in ascending order of average profit, with "Tables" having the lowest average profit and "Accessories" having the highest. Significant profit difference: There is a substantial difference in average profit between the top sub-category (Accessories) and the rest. Accessories stands out with an average profit significantly higher than other sub-categories. Negative or low profits: Several sub-categories, such as Tables, Bookcases, Supplies, and Fasteners, have very low or potentially negative average profits, as indicated by the shorter bars close to the x-axis. Moderate profits: Sub-categories like Art, Furniture, Lamps, Binders, Paper, and Copying have moderate average profits, as shown by the bars of medium height. Clustering: There appears to be a clustering or grouping of sub-categories with similar average profit levels, such as the cluster around moderate profits and the cluster around low profits.
This analysis can provide valuable insights into the profitability and performance of different product or service categories within the business. The sub-categories with high average profits (like Accessories) may be focus areas for growth and investment, while those with low or negative average profits may require further examination to identify potential issues or opportunities for improvement.
Modelling Strategy - Linear Regression
Analysis:
Based on the bar plot showing Quantity vs. Profit, it appears that there is a positive relationship between quantity and profit. As the quantity increases, the profit also tends to increase, though there is considerable variation within each quantity level.
A few key observations from the bar plot:
Extremely high profits: There is an extremely high profit value associated with a quantity of 10000, which stands out as an apparent outlier. Highest average profits: The highest average profits seem to be associated with quantities around 10-14, as indicated by the tall bars in that range. Low quantities, low profits: For the lower quantity values (1-9), the profits are generally lower, with some negative profit values as well. Variation within quantities: For most quantity levels, there is a wide range of profit values, suggesting that other factors beyond just quantity influence profitability. Potential outliers: In addition to the extreme value at quantity 10000, there may be a few other potential outliers or unusual data points at various quantity levels, as evidenced by the exceptionally high or low bars compared to the rest within that quantity. Overall, the bar plot suggests that higher quantities tend to be associated with higher profits, but there is significant variability, and other factors likely play a role in determining profitability.
Outputs: Based on the data analysis methods adopted, the following points can be noted as the output of this analysis.
Data Cleaning and Outlier Removal: The dataset was cleaned by removing extreme outliers using a modified Interquartile Range (IQR) method. 1,543 records were identified as extreme outliers and removed from the dataset, reducing the total number of records from 9,994 to 8,451. Summary Statistics: The cleaned dataset has 8,451 records with varying sales, quantity, discount, and profit values. Central tendency measures like mean and median were calculated for each variable. Approximately 8.14% of items were returned, while 91.86% were not returned. Visualizations: Boxplots were used to visualize the distribution and identify potential outliers for variables like quantity, discount, sales, and profit. A scatter plot of sales vs. profit showed a positive correlation, but with significant variation, especially at higher sales levels. A bar chart of average profit by sub-category revealed that "Accessories" had the highest average profit, while sub-categories like "Tables" and "Bookcases" had low or negative average profits. A bar plot of quantity vs. profit suggested a positive relationship, but with considerable variation within each quantity level. Logistic Regression Model: A logistic regression model was built to predict the profitability of sales transactions based on features like sales, quantity, and discount. The model achieved an accuracy of approximately 92.96% on the test set and a ROC AUC score of 0.948, indicating good discriminative ability. Linear Regression Model: A linear regression model was created to identify the factors contributing to profitability. The model found that sales had the most significant positive impact on profit, followed by quantity, while discounts had a negative impact on profit. Recommendations: The following measures can be adopted by Superstore company to maximize its profitability:
1.Focus on High-Performing Sub-Categories: Prioritize and invest in sub-categories with high average profits, such as "Accessories," to drive overall profitability. Conduct further analysis to understand the factors contributing to the success of these sub-categories and replicate best practices across other product lines. 2.Optimize Low-Performing Sub-Categories: Closely examine sub-categories with low or negative average profits, such as "Tables" and "Bookcases," to identify areas for improvement or potential discontinuation. Consider cost optimization strategies, pricing adjustments, or product line restructuring for these underperforming sub-categories. 3.Sales and Quantity Strategies: Implement strategies to increase sales volumes and quantities, as these factors have a positive impact on profitability, according to the linear regression model. Explore marketing campaigns, promotions, or product bundling to drive higher sales and quantities for profitable product lines. 4.Discount Optimization: Review and optimize discount policies, as the analysis suggests that higher discounts lead to lower profitability. Consider targeted discounting strategies or adjust discount rates to strike a balance between attracting customers and maintaining profitability. 5.Outlier Investigation: Investigate the high sales and profit outliers identified in the scatter plot to understand the factors contributing to their exceptional performance. Similarly, analyze the negative profit outliers to identify potential issues or areas for improvement in operations, pricing, or cost management. Continuous Monitoring and Improvement: Implement a regular data analysis and monitoring process to track sales, profitability, and other key performance indicators across sub-categories and product lines. Continuously refine and update the models based on new data and insights to adapt to changing market conditions and customer preferences. By implementing these recommendations, the Superstore company can leverage data-driven insights to optimize their product portfolio, pricing strategies, and operational efficiencies, ultimately driving increased profitability and sustainable growth.
References:
1.Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). Hoboken, NJ: John Wiley & Sons. 2.James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with applications in R. New York, NY: Springer. 3.Leys, C., Ley, C., Klein, O., Bernard, P., & Licata, L. (2013). Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49(4), 764-766. 4.Muller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: A guide for data scientists. Sebastopol, CA: O'Reilly Media. Provost, F., & Fawcett, T. (2013). Data science for business: What you need to know about data mining and data-analytic thinking. Sebastopol, CA: O'Reilly Media. 5.Raschka, S. (2015). Python machine learning. Birmingham, UK: Packt Publishing Ltd. 6.Varian, H. R. (2014). Big data: New tricks for econometrics. Journal of Economic Perspectives, 28(2), 3-28. 7.Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data. Sebastopol, CA: O'Reilly Media. 8.Zheng, A., & Casari, A. (2018). Feature engineering for machine learning: Principles and techniques for data scientists. Sebastopol, CA: O'Reilly Media. 9.Zumel, N., & Mount, J. (2014). Practical data science with R. Shelter Island, NY: Manning Publications.