Done by, ZAHID BIN JAMSHEED (H00440696)
Coursework 2 (C11BD)
INTRODUCTION
In today's fast-paced business world, staying ahead of the competition requires tapping into the power of data. For retail businesses, this means using insights from data to boost growth and increase profits. By analyzing information about products, customers, and transactions, retailers can tune their operations, streamline supply chains, and enhance marketing strategies.
This study dives into a thorough analysis of a dataset containing 9994 entries across 29 categories, spanning product, client, and transaction details. The main goal here is to conduct an exploratory data analysis, which involves cleaning up the data, crunching numbers to find trends, and creating visual representations to make sense of it all. Additionally, we'll employ advanced modeling techniques to uncover the key factors influencing the company's profitability.
Using Python in Deepnote, we'll meticulously document our data analysis process, from start to finish. Our report will be structured to first address any data issues, then explain our chosen modeling method and its execution, and finally, delve into the results and their implications.
Utilizing the KMeans clustering technique on the refined dataset to pinpoint separate groups of data points indicating various profitability levels.
Basically, study is going to help the retail company figure out how to do things better. By understanding their data, they can make their business smoother and find ways to make more money. So, they can make smart choices and keep growing and succeeding.
1. Import the Data
Run to view results
Run to view results
Run to view results
Run to view results
Basic Data Info
Run to view results
The data has 9994 entries and 29 categories. Each category has 9994 values, showing that there are no missing values in the dataset.
Run to view results
Run to view results
2. Cleaning the Data
In order to enhance the accuracy and quality of the data, it's important that we undertake data cleansing measures.
(i.) Renaming Columns
To streamline data analysis and optimize data visualization, we propose renaming the columns within the dataset.
Run to view results
(ii.) Remove unnecessary Columns
In order to enhance data quality, minimize memory usage, and streamline data analysis, it is essential to eliminate unnecessary columns from the dataset.
It appears that columns such as 'customer no,' 'segment no,' 'state no,' 'region no,' 'category no,' 'subcategory no,' and 'product name no' merely serve as numeric representations of other columns. This suggests that they may not be essential for our analysis and could potentially be excluded.
Run to view results
Run to view results
Before removing unnecessary data for analysis, the dataset had 9994 rows and 29 columns. After dropping the unessential columns, the dataset's shape became 9994 rows and 23 columns. This means that certain columns were deemed unnecessary for the analysis, so they were removed, resulting in a reduction of columns from 29 to 23 while keeping the same number of entries (rows).
Run to view results
(iii.) Convert the types of datas
Modifying the data type of "Order Date" and "Ship Date" is essential for refining the dataset efficiently. It ensures accurate representation of temporal information, facilitates sorting and filtering tasks, and enhances overall data integrity. This adjustment is crucial for enabling precise analysis and informed decision-making based on temporal aspects within the dataset.
Run to view results
Run to view results
(iv.) Dealing with null entries
Addressing missing data is vital in ensuring the accuracy and reliability of findings during data analysis and modeling. Failure to handle missing values properly can introduce biases or errors that may skew the results. Therefore, it's imperative to employ appropriate techniques to manage missing data effectively, enhancing the robustness and credibility of the analytical process.
Run to view results
Based on the displayed results, we observe a dataset comprising 9994 instances distributed among 23 variables. Each variable accommodates 9994 records, indicating a comprehensive dataset without any missing values. This thoroughness ensures that the dataset is complete and ready for analysis without the need for data correction.
(v.) Recognise and discard duplicated records
Finding and deleting duplicate data is crucial in data analysis because it can skew results and lead to inaccurate conclusions. Additionally, duplicate entries can disrupt model creation since some algorithms expect each observation to be unique. Spotting and removing duplicate data is a vital part of data cleaning and preprocessing, ensuring that the data is accurate and dependable.
Run to view results
Rather than utilising subsets, I opt to identify and remove identical data within columns. It's common to observe multiple entries for the same customer, possibly representing various transactions or interactions with the business. However, eliminating these duplicates may result in losing valuable insights into customer preferences and behaviours.
(vi.) Detection of outliers
Before proceeding, it's essential to detect any outliers to prevent potential issues such as measurement errors, errors in data entry, or fluctuations inherent in the dataset.
Run to view results
When we look at 'Quantity', 'Sales', 'Discount', and 'Profit', we might notice some unusual data points that could be mistakes. But just because a value seems different doesn't always mean it's an outlier. We need to consider the situation and decide if these values are really outliers. Let's take a closer look at that:
(vii.) IQR (Interquartile Range) method
Using the IQR method could be a wiser choice. In this dataset, the IQR approach is advantageous because it isn't easily swayed by outliers, making it more fitting for skewed data distributions, a common scenario in customer datasets. Typically, only a few customers make substantial purchases, while the majority make smaller ones. By focusing on the interquartile range, we can better grasp the typical spending patterns of most customers, minimizing the impact of extreme purchases by a select few.
Run to view results
(viii.) Removing outliers
Run to view results
previous shape:- (9994,23)
Run to view results
New cleaned dataset
Run to view results
Key Insights from cleaned Dataset
After carefully cleaning and refining the dataset, the summary statistics give us a good look at the main numbers and trends. They help us understand the important parts of the cleaned-up data, like the average values, how spread out the data is, and other important details. By simplifying the complex data into short summaries, analysts can easily spot patterns and make smart decisions they can trust
Run to view results
Run to view results
Run to view results
Run to view results
Looking at the numbers, it's clear that the Western region is making more money than any other area. Its impressive results really stand out, showing how strong its economy is. This tells us that the Western region is a big player in making profits overall, shining brightly among all the data we've got.
Run to view results
Run to view results
After checking out the data, it's clear that office supplies are high in quantity. This tells us how important office supplies are and how much they're needed in our overall distribution plan.
Run to view results
Technology makes more better, doing way better than everything else. Its profits are really impressive, showing just how much cash it brings in. This tells us that technology is the highest in the business world and can make a lot of money in this mix of data.
Run to view results
Run to view results
3. Summary statistics of cleaned data
We can gain insights into the spread, typical values, and variations within our data through a collection of summary statistics. My approach involves computing these statistics across three distinct categories: numerical data, categorical attributes, and date-based variables.
(A.) Numerical Variables
(i.) Numerical variables
Run to view results
Run to view results
The sales data exhibits an average amount of 75.50 USD with a standard deviation of 92.21 USD, suggesting a varied spread around this mean. Similarly, the average number of units sold is 3.40, with a slight deviation of 1.87, indicating consistency in sales volume. The discount offered averages at 9.67% with a standard deviation of 10.63%, showcasing variability in pricing strategies. Regarding profits, the average per transaction stands at 13 USD, with a standard deviation of 17.18 USD, indicating fluctuations in profitability across transactions.
Furthermore, there are identifiable ranges within which most purchases, quantity sold, and profits fall. Alongside these averages, the minimum and maximum values for each metric provide additional context to the data's distribution and variability.
Concludingly, the rate of returns stands at 8.1%, indicating that the vast majority of sales remain intact without being returned.
-> Advices:-
In light of these discoveries, it's advisable for the company to delve deeper into their sales data to uncover underlying patterns or emerging trends. It's equally crucial for them to explore the interconnections between different variables, seeking out any correlations that could be influencing their sales and profitability.
Moreover, a strategic move would involve tailoring discounts more precisely to customers' historical purchasing behaviours. This personalised approach has the potential to bolster both sales figures and overall profits.
Lastly, it's imperative for the company to maintain vigilance over their return rate, ensuring it stays within acceptable bounds. This ongoing monitoring guarantees continued customer satisfaction and sustains the company's reputation for reliability.
Drawing from these insights, several suggestions emerge for further analysis and potential actions:
• One proposed course of action stemming from these findings involves delving deeper into the distribution of sales and profits. This entails scrutinising whether specific subcategories or products consistently yield higher or lower profits. Additionally, examining the distribution of returns, characterized by a relatively low mean yet a notably high maximum value of 1, could unveil insights. Such analysis might unveil products or categories more prone to returns, potentially impacting overall profitability.
• Another avenue for exploration could entail a more nuanced examination of the correlation between discounts and profits. Are there specific products or categories where offering discounts notably influences sales or profitability? Are particular discount levels consistently associated with higher or lower profits? By dissecting this relationship, potential strategies for fine-tuning discounts and enhancing profitability could be identified.
(ii.) Comparison numerical
Run to view results
The dataset comprises three distinct categories: office supplies, technology, and furniture. Among these, office supplies exhibit the lowest average sales at 48.68 USD, while furniture boasts the highest average sales of 131.70 USD, coupled with the widest sales variation represented by a standard deviation of 119.26 USD. Furniture also stands out with the highest median number of sales, clocking in at 3 units, and the most substantial profit margin at 70.72%.
In terms of subcategories, fasteners register the lowest mean sales at 12.94 USD, contrasting sharply with copiers, which lead with the highest mean sales at 479.98 USD. Machines emerge as noteworthy for their significant sales variability, indicated by the largest standard deviation of 133.81 USD. Additionally, machines command attention for their highest median quantity sold (4 units) and the topmost maximum profit margin of 70.00%, a position shared with storage and copiers.
It's worth noting that across all categories and subcategories, the incidence of returned goods remains relatively low.
-> Advice:-
Let's prioritise examining Furniture and Technology categories due to their consistently higher average sales when contrasted with Office Supplies. Within these categories, we should delve into specific sub-categories like Bookcases, Chairs, and Phones, which exhibit notably higher average sales compared to others.
Fortunately, our return rates remain relatively low across all categories and sub-categories, signalling positive customer satisfaction. Nonetheless, it's crucial to maintain vigilance over return rates and delve into the reasons behind any returned products.
It's worth noting that certain sub-categories, such as Copiers, Machines, and Tables, demonstrate higher standard deviations in sales, indicating greater sales volatility. This warrants deeper investigation to understand the underlying factors contributing to these fluctuations.
Additionally, the profit margins of select sub-categories appear to be suboptimal, suggesting a need to reassess our pricing strategies, explore avenues for cost reduction, or potentially discontinue these less profitable sub-categories."
(B.) Categorical Variables
Run to view results
The dataset comprises 7140 entries devoid of any missing data. Among these, 4183 unique order IDs signify that certain orders encompass multiple products. The orders originate from 787 distinct consumers residing in 508 cities and 48 states across the United States. It encompasses a total of 1675 distinct products, categorized into technology, office supplies, and furniture. Additionally, it highlights the most frequently occurring order ID, ship mode, customer ID, customer name, segment, state, product ID, category, subcategory, and product name.
-> Advices:-
(C.) Date Variables
Let's analyze the data over time by computing summary statistics for the date variables, such as order date and shipping date, found in the dataset
Run to view results
The dataset captures information spanning four years, covering orders placed between 2014 and 2017, and shipments made between 2014 and 2018. Most orders and shipments occurred during the summer months, with average order_month and ship_month values close to 7, suggesting a seasonal trend. Additionally, the average order_day and ship_day values around 16 indicate a preference for mid-month transactions. Despite considerable variability in order_month and ship_month, indicating fluctuations over time, the standard deviation for order dayofweek and ship dayofweek is smaller, reflecting more uniform distribution across the week. The fact that minimum and maximum values for order_day and ship_day align with calendar days underscores the integrity of the data, indicating no missing or erroneous entries.
-> Advices:-
4. Visualising the Dataset
Categorical Visualisation of Bar Chart
Run to view results
The above bar chart shows the distribution of orders across product categories, but broken down by region. It appears that technology is the most popular category across all regions, followed by furniture and office supplies. Central region seems to have the most orders overall, followed by East, South and West.
Run to view results
This graph displays total sales by product category and subcategory. We can see that Furniture is the leading category, followed by Technology and Office Supplies. Within the Furniture category, Chairs and Bookcases appear to be top sellers.
Run to view results
The graph shows total profit and sales by product category and subcategory. The blue line represents the profit, and the orange line represents the sales. Furniture is the leading category in both sales and profit, with Chairs and Bookcases being the top subcategories. Overall, sales appear to be higher than profits, which means the profit margin on these products may be low.
Run to view results
Continuous Data: Scatter Plot Analysis
Run to view results
The above scatter graph depicts total sales over time, there is been an upward trend in sales over time. There are fluctuations throughout the period, but the general trend is positive. Sales appear to start around 500 and end around 3000
Run to view results
The graph shows a scatter plot, where each data point represents the sales and profit for a specific product category. There appears to be a positive correlation between sales and profit, meaning that categories with higher sales tend to also have higher profits. One data point in the upper right corner stands out from the rest, indicating a category with both high sales and high profits
5. Modeling Strategy for Profitability Analysis
K-means clustering is an excellent choice for analyzing the factors contributing to the profitability of the company because it allows us to identify distinct groups or clusters within the data based on similarities in their features. Here's a detailed justification for selecting k-means clustering for this analysis:
Run to view results
This graph displays the Within Cluster Sum of Squared Errors (WCSS) against different values of K. The elbow method is used to find the optimal number of clusters for KMeans clustering. The plot shows a downward trend in WCSS as the number of clusters increases. We identify the "elbow" point where the rate of decrease slows down significantly. This point represents the optimal number of clusters, where adding more clusters doesn't provide much improvement in WCSS. In this graph, the elbow point is where K=6, suggesting that 6 clusters might be the optimal choice for this dataset.
Run to view results
Run to view results
This 3D scatter plot represents the relationship between sales, profit, and quantity, where each data point is assigned a color corresponding to its cluster label. The clusters are generated using KMeans clustering with six clusters. By visually inspecting the plot, we can observe how data points within each cluster are distributed across these three dimensions. The spatial arrangement of points provides insights into the segmentation of the dataset and potential patterns or trends within each cluster. This visualization helps in understanding the distinct groupings present in the data and can guide further analysis or decision-making processes.