C11BD- Coursework 2
H00435896
Introduction:
In today's data-driven business world, companies are increasingly relying on data analytics to gain valuable insights and make informed decisions. This report focuses on the application of big data analytics to enhance profitability for the company. The dataset provided, Superstore.csv, contains detailed information about customer orders, products, and sales. The objective of this analysis is to utilize data cleaning, descriptive statistics, data visualization, and modeling techniques to identify factors that significantly impact the company's profitability. By uncovering patterns, trends, and relationships within the data, the report aims to provide valuable insights which will improve the company's financial performance.
Methodology
1. Data Import and Understanding
The first step in our methodology is to import the provided dataset, dataset_Superstore.csv, into our Python environment. We will utilize appropriate libraries, such as pandas, to read the CSV file and store it as a dataframe for further analysis. This step also includes gaining a thorough understanding of the dataset's structure, variables, and their descriptions using appropriate technique.
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
2. Data Cleaning
Data cleaning is essential to ensure the accuracy and reliability of our analysis. In this step, we will identify and address any data entry errors, missing values, or outliers present in the dataset. Techniques such as removal of duplicate records, and handling missing values will be applied based on the the data based on the specific requirements of the analysis (Bennihi, Zirari and Medjahed, 2022).
Checking for Missing Values and handling them
As we can observe from the result there are no missing values in our dataset in any of the column.
Run to view results
Run to view results
Checking for Duplicates
We then check for any duplicated values and found that there are no duplicated values.
Run to view results
Run to view results
Run to view results
Run to view results
Converting Columns to appropriate format
Run to view results
Converting Returned column from Boolean to Int for further analysis
Run to view results
Summarizing the statistic of the Cleaned data
Run to view results
3.Exploratory Data Analysis
Finding the percentage of transaction which are profit and loss making.
Run to view results
Profit/Loss on each product
Run to view results
Finding the percentage of product that are loss making
Run to view results
Visualizing the correlation between all the relevant columns using the correlation matrix.
Run to view results
Run to view results
From the visualization we found the most preferred Ship mode, and order distribution between different region, segment and category.
Run to view results
From the above chart we understand the best performing cities in terms of order frequency and Average sales. It also highlights the city with highest returns.
Run to view results
From the above scatter plots, we were able to infer that, profit and sales have in general a positive correlation, profit and discount has a negative correlation in general and Profit vs Quantity has a positive correlation in general.
Run to view results
In the above code, we first calculated a new column or feature called Delivery time, which is the time taken to deliver the products, we then calculated different stats like mean days taken to deliver which was almost four days.
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
The above plot focuses on Average sales, number of returns and type of ship mode used per region. This can help us focus on most profitable region. Type of ship can be used to check if we can remove some of the options in a particular region if it is not used thus making the operational and logistical process easy for that region.
Run to view results
Top selling products by different criteria along with top product which are returned. This allows us to focus on these product as either top selling product will help us increase the sales while focusing on the top returned product helps us in minimize these return and avoid any further loses. As any returned product would incur logistic losses.
Run to view results
Most profitable category, sub-category, segment and region.
4. Outlier Detection and Handling
Dropping Columns
Dropping Columns which are non-numerical, redundant or not required. Many models don't work with non-numerical data and we might have to perform label encoding to use them, but in this case target columns already have their numerical substitute that can be utilized. For example, we already have a numerical column called segment number so we are removing the column called Segment as both of them represent the same thing.
Run to view results
Plotting box and whisker plot to detect Outlier
We are using box and whisker plot to detect and analyze the outliers. Box and whisker plot provides a clear and concise summary of the data distribution, including the presence of outliers, the range, quartiles, and median, thus allowing for an easier analysis(Iglewicz, 2011).
Outliers in this case are data points that fall outside the upper or lower whiskers of the box plot. The whiskers represent the range of data within 1.5 times the interquartile range (IQR) of the middle 50% of the data. Data points that fall outside this range are considered outliers (Dawson, 2011).
Run to view results
Examining the box plot, reveals that there are outliers for Profit, Sales, Discount, and Quantity. This means there are data points for these features that fall outside the typical range represented by the box and whiskers. In other words, there are instances of unusually high or low profits compared to the majority, exceptionally high or low sales figures, discounts that are significantly larger or smaller than usual, and quantities that deviate significantly from the rest of the data.
Note: Categorical features like Segment_no, State_no, Category_no, and Sub-Category_no usually have a limited number of categories and wouldn't necessarily show outliers in the same way.
Outlier Handling using Inter Quartile Range (IQR)
IQR measures how spread out the middle half of the data is. It doesn't take into account extreme values that might impact the overall range. Imagine the data points ordered from least to greatest. The IQR is essentially the difference between the value at the 75th percentile (Q3) and the value at the 25th percentile (Q1). So, it shows how much variation there is within the central 50% of the data set. In this case IQR provides an idea of how tightly clustered the middle values are, thus allowing us to detect and handle the outlier.
Run to view results
The above code first calculate the outlier by calculating the difference between Q3 and Q1. Then it checks if there are any values that falls outside the IQR range i.e., outlier, and finally removes the identified outliers. We have also checked the shape and size of data before and after the outlier handling. The new data frame has 7140 elements which is 2854 less than the original value thus indicating that we have handled any potential outlier (Vinutha, Poornima and Sagar, 2018).
5.Modelling
To holistically analyze the Superstore data, we'll utilize three of machine learning models. Linear regression tackles continuous relationships, while Random Forest handles into complex patterns and classifications. Finally, K-Means clustering finds hidden structures by grouping similar data points. This multi-model approach offers a rich understanding of the data, helping in sales predictions, customer segmentation and retention thus increasing the profit (Wasserbacher and Spindler, 2021).
Run to view results
Run to view results
Linear Regression
Run to view results
Result:
Model Performance:
Training set score (0.3798) and Test set score (0.3695) are R-squared values which indicate a moderate fit, explaining around 37-38% of the variance in sales for both the training and testing data. Regression Score (0.3695) is same as the test set score. Mean Squared Error (5208.96) and Root Mean Squared Error (72.17) are the error stats which shows the average squared difference and its square root between predicted and actual sales values. Lower values indicate better fit usually. Mean Absolute Percentage Error (1.84%) and Mean Absolute Error (48.00) are metrics which represent the average absolute difference between predicted and actual sales, with and without percentages. Lower value suggests better prediction accuracy.
Coefficient: the coefficient of a feature tell you the direction and strength of its relationship with the target variable. Positive coefficient represent positive relationship and vice versa
Conclusion: While the model seems to capture some trends in the data based on the R-squared and Mean Absolute Percentage Error, the overall fit might be not be entirely sufficient.
Random Forest Classification
Run to view results
Adding back Customer ID column to the df
Run to view results
Run to view results
Result
We used Random Forest models for two distinct tasks: profit category prediction and customer retention prediction.
Model 1: Profit Category Prediction
This model aimed to classify profits on order into profit, Loss or None/Small Profit categories to classify highly profitable orders.
Performance: The model achieved an accuracy of 93%, indicating a high success rate in correctly classifying profit categories. Precision and recall were also impressive at around 93%, signifying a low chance of both false positives (incorrectly predicting a loss) and false negatives (missing actual losses).
Model 2: Customer Retention Prediction
This model aimed to predict customer return frequency, potentially aiding in customer retention efforts.
Performance: The model's accuracy was moderate at 64%. While recall (0.64) suggests it identifies a good amount of returning customers, the low precision (0.42) indicates a high rate of false positives. This means the model often incorrectly predicts customers who won't return frequently.
K Means Clustering
Adding Customer Name column back to the df
Run to view results
Run to view results
Run to view results
Result
Finally we use clustering for customer segmentation. We first use elbow curve to determine the optimal value of k and then perform segmentation of customer into 5 categories. We have displayed how many customers are in each category. This allows for targeted promotion, marketing etc. to increase the profitability.
Conclusion
This exploration of the superstore data used a holistic approach, utilizing linear regression, random forest classification, and k-means clustering to gain a comprehensive understanding of factors influencing profitability. While linear regression provided a starting continuous relationships. Random forest, on the other hand, helped in uncovering complex patterns and classifications. For instance, the profit category prediction model using Random Forest achieved impressive accuracy (93%), allowing businesses to identify potential losses early and take corrective actions. K-means clustering, proved valuable in revealing hidden customer segments. By grouping customers with similar buying behaviors, this approach can be used for targeted marketing strategies and promotions, potentially leading to increased customer lifetime value. Customer retention model using Random Forest highlighted the potential of for predicting customer return frequency, a key metric for retention. Overall, this multi-model approach provides valuable insights which when implemented effectively can significantly enhance business profitability.