Done by, ZAHID BIN JAMSHEED (H00440696)

Coursework 2 (C11BD)

INTRODUCTION

In today's fast-paced business world, staying ahead of the competition requires tapping into the power of data. For retail businesses, this means using insights from data to boost growth and increase profits. By analyzing information about products, customers, and transactions, retailers can tune their operations, streamline supply chains, and enhance marketing strategies.

This study dives into a thorough analysis of a dataset containing 9994 entries across 29 categories, spanning product, client, and transaction details. The main goal here is to conduct an exploratory data analysis, which involves cleaning up the data, crunching numbers to find trends, and creating visual representations to make sense of it all. Additionally, we'll employ advanced modeling techniques to uncover the key factors influencing the company's profitability.

Using Python in Deepnote, we'll meticulously document our data analysis process, from start to finish. Our report will be structured to first address any data issues, then explain our chosen modeling method and its execution, and finally, delve into the results and their implications.

Utilizing the KMeans clustering technique on the refined dataset to pinpoint separate groups of data points indicating various profitability levels.

Basically, study is going to help the retail company figure out how to do things better. By understanding their data, they can make their business smoother and find ways to make more money. So, they can make smart choices and keep growing and succeeding.

1. Import the Data

# import pandas library for data analysis import pandas as pd # import matplotlib for data visualization import matplotlib.pyplot as plt

Run to view results

# using read_csv function to read the csv file dataset = pd.read_csv('dataset_Superstore.csv') dataset

Run to view results

# using .head() function to view the rows dataset.head()

Run to view results

print('shape:', dataset.shape)

Run to view results

Basic Data Info

# using .info() function to view the informations on dataset print (dataset.info())

Run to view results

The data has 9994 entries and 29 categories. Each category has 9994 values, showing that there are no missing values in the dataset.

dataset.describe()

Run to view results

# count non-null values in each column dataset.count()

Run to view results

2. Cleaning the Data

In order to enhance the accuracy and quality of the data, it's important that we undertake data cleansing measures.

(i.) Renaming Columns

To streamline data analysis and optimize data visualization, we propose renaming the columns within the dataset.

# using .rename() function to rename the columns sub_dataset = dataset.rename (columns = { 'Row ID': 'row_id', 'Order ID': 'order_id', 'Order Date': 'order_date', 'Ship Date': 'ship_date', 'Ship Mode': 'ship_mode', 'Customer ID': 'customer_id', 'Customer Name': 'customer_name', 'Customer_no': 'customer_no', 'Segment': 'segment', 'Segment_no': 'segment_no', 'Country': 'country', 'City': 'city', 'State': 'state', 'State_no':'state_no', 'Postal Code': 'postal_code', 'Region': 'region', 'Region_no': 'region_no', 'Product ID': 'product_id', 'Category': 'category', 'Category_no': 'category_no', 'Sub-Category': 'sub_category', 'Sub-Category_no': 'sub_category_no', 'Product Name': 'product_name', 'Product Name_no': 'product_name_no', 'Sales': 'sales', 'Quantity': 'quantity', 'Discount': 'discount', 'Profit': 'profit', 'Returned': 'returned', })

Run to view results

(ii.) Remove unnecessary Columns

In order to enhance data quality, minimize memory usage, and streamline data analysis, it is essential to eliminate unnecessary columns from the dataset.

It appears that columns such as 'customer no,' 'segment no,' 'state no,' 'region no,' 'category no,' 'subcategory no,' and 'product name no' merely serve as numeric representations of other columns. This suggests that they may not be essential for our analysis and could potentially be excluded.

# using .drop() function to remove rows or columns from the dataset sub_dataset = sub_dataset.drop(['customer_no', 'segment_no', 'state_no', 'region_no', 'category_no', 'sub_category_no'], axis=1)

Run to view results

print('Shape:', sub_dataset.shape)

Run to view results

Before removing unnecessary data for analysis, the dataset had 9994 rows and 29 columns. After dropping the unessential columns, the dataset's shape became 9994 rows and 23 columns. This means that certain columns were deemed unnecessary for the analysis, so they were removed, resulting in a reduction of columns from 29 to 23 while keeping the same number of entries (rows).

# generate descriptive satistics sub_dataset.describe()

Run to view results

(iii.) Convert the types of datas

Modifying the data type of "Order Date" and "Ship Date" is essential for refining the dataset efficiently. It ensures accurate representation of temporal information, facilitates sorting and filtering tasks, and enhances overall data integrity. This adjustment is crucial for enabling precise analysis and informed decision-making based on temporal aspects within the dataset.

# using .to_datetime() function to convert columns to datetime format sub_dataset['order_date']=pd.to_datetime(sub_dataset['order_date']) sub_dataset['ship_date']=pd.to_datetime(sub_dataset['ship_date'])

Run to view results

# using .astype() method to convert from boolean (1 for True and 0 for False) format to integer format sub_dataset['returned'] = sub_dataset['returned'].astype(int)

Run to view results

(iv.) Dealing with null entries

Addressing missing data is vital in ensuring the accuracy and reliability of findings during data analysis and modeling. Failure to handle missing values properly can introduce biases or errors that may skew the results. Therefore, it's imperative to employ appropriate techniques to manage missing data effectively, enhancing the robustness and credibility of the analytical process.

# utilizing the .isnull() method to identify any instances of missing data within the dataset # using the .sum() method to determine the collective count of missing values across each column in the database print('missing values distribution:') print(sub_dataset.isnull().sum()) print('')

Run to view results

Based on the displayed results, we observe a dataset comprising 9994 instances distributed among 23 variables. Each variable accommodates 9994 records, indicating a comprehensive dataset without any missing values. This thoroughness ensures that the dataset is complete and ready for analysis without the need for data correction.

(v.) Recognise and discard duplicated records

Finding and deleting duplicate data is crucial in data analysis because it can skew results and lead to inaccurate conclusions. Additionally, duplicate entries can disrupt model creation since some algorithms expect each observation to be unique. Spotting and removing duplicate data is a vital part of data cleaning and preprocessing, ensuring that the data is accurate and dependable.

# recogising duplicates # using .duplicated() method is used to identify duplicate rows duplicates = sub_dataset.duplicated() # determine the total count of repeated records # using .sum() method is used to determine the total number of True values, which correlates to the total num_duplicates = duplicates.sum() print(f'Number of duplicates: {num_duplicates}') # remove identical entries # using .drop_duplicates() method to remote duplicate rows sub_dataset = sub_dataset.drop_duplicates() # verifying duplicates are removed duplicates = sub_dataset.duplicated() num_duplicates = duplicates.sum() print(f'number of duplicates after removal: {num_duplicates}')

Run to view results

Rather than utilising subsets, I opt to identify and remove identical data within columns. It's common to observe multiple entries for the same customer, possibly representing various transactions or interactions with the business. However, eliminating these duplicates may result in losing valuable insights into customer preferences and behaviours.

(vi.) Detection of outliers

Before proceeding, it's essential to detect any outliers to prevent potential issues such as measurement errors, errors in data entry, or fluctuations inherent in the dataset.

sub_dataset.describe()

Run to view results

When we look at 'Quantity', 'Sales', 'Discount', and 'Profit', we might notice some unusual data points that could be mistakes. But just because a value seems different doesn't always mean it's an outlier. We need to consider the situation and decide if these values are really outliers. Let's take a closer look at that:

(vii.) IQR (Interquartile Range) method

Using the IQR method could be a wiser choice. In this dataset, the IQR approach is advantageous because it isn't easily swayed by outliers, making it more fitting for skewed data distributions, a common scenario in customer datasets. Typically, only a few customers make substantial purchases, while the majority make smaller ones. By focusing on the interquartile range, we can better grasp the typical spending patterns of most customers, minimizing the impact of extreme purchases by a select few.

# detecting outliers through the Interquartile Range (IQR) technique def find_outliers_iqr(data, columns): outliers = pd.DataFrame() for column in columns: Q1 = data[column].quantile(0.25) Q3 = data[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers_column = (data[column] < lower_bound) | (data[column] > upper_bound) outliers = pd.concat([outliers, outliers_column], axis=1) outliers.columns = columns return outliers # outliers identification columns_to_check = ['quantity', 'sales', 'discount', 'profit'] outliers = find_outliers_iqr(sub_dataset, columns_to_check) # displaying the count of outliers detected within each specified column for column in columns_to_check: print(f"Number of outliers in '{column}' column: {outliers[column].sum()}")

Run to view results

(viii.) Removing outliers

# filter out unusual data sub_dataset_final = sub_dataset[~outliers.any(axis = 1)] # displaying the number of outliers that have been eliminated print(f"number of outliers removed: {len(sub_dataset) - len(sub_dataset_final)}") # print the shape of the dataframe after removing outliers print(f"shape of the dataframe without outliers: {sub_dataset_final.shape}")

Run to view results

previous shape:- (9994,23)

print('Shape:', sub_dataset_final.shape)

Run to view results

New cleaned dataset

sub_dataset_final.head() sub_dataset_final

Run to view results

Key Insights from cleaned Dataset

After carefully cleaning and refining the dataset, the summary statistics give us a good look at the main numbers and trends. They help us understand the important parts of the cleaned-up data, like the average values, how spread out the data is, and other important details. By simplifying the complex data into short summaries, analysts can easily spot patterns and make smart decisions they can trust

# grouping the 'sub_dataset_final' DataFrame by the 'segment' column and calculating the size of each group count_seg = sub_dataset_final.groupby('segment').size() print (count_seg)

Run to view results

# grouping the 'sub_dataset_final' DataFrame by the 'state' column and calculating the size of each group count_state = sub_dataset_final.groupby('state').size() # sorting the counts of instances for each state in ascending order and printing the result print(count_state.sort_values())

Run to view results

# looking at which 'state' accounted for highest profit state_profit = sub_dataset_final[['state','profit']].groupby('state').sum() print(state_profit.sort_values(by = 'profit'))

Run to view results

# looking at which 'region' accounted for highest profit state_profit = sub_dataset_final[['region','profit']].groupby('region').sum() print(state_profit.sort_values(by = 'profit'))

Run to view results

Looking at the numbers, it's clear that the Western region is making more money than any other area. Its impressive results really stand out, showing how strong its economy is. This tells us that the Western region is a big player in making profits overall, shining brightly among all the data we've got.

# looking average profit from each region region_profit = sub_dataset_final[['region','profit']].groupby('region').mean() print(region_profit.sort_values(by = 'profit'))

Run to view results

# looking at which product category are most sold in quantities sold_product = sub_dataset_final[['category','quantity']].groupby('category').sum() print(sold_product.sort_values(by = 'category'))

Run to view results

After checking out the data, it's clear that office supplies are high in quantity. This tells us how important office supplies are and how much they're needed in our overall distribution plan.

# looking at which category brings more profit product_avgprofit = sub_dataset_final[['category','profit']].groupby('category').mean() print(product_avgprofit.sort_values(by = 'profit'))

Run to view results

Technology makes more better, doing way better than everything else. Its profits are really impressive, showing just how much cash it brings in. This tells us that technology is the highest in the business world and can make a lot of money in this mix of data.

# looking at which product segment is returned the most print (sub_dataset_final['returned'].describe()) print (sub_dataset_final.groupby('category').size().sort_values()) return_prod = sub_dataset_final[['category','returned']].groupby('category').sum() print(return_prod.sort_values(by = 'returned'))

Run to view results

# displaying the new shape of the dataset after removing outliers, along with its previous shape. Additionally, providing detailed information about the dataset using the info() method print ('Shape without outliers:', sub_dataset_final.shape) print ('previous shape:(9994, 23)') print (' ') print ('information',sub_dataset_final.info())

Run to view results

3. Summary statistics of cleaned data

We can gain insights into the spread, typical values, and variations within our data through a collection of summary statistics. My approach involves computing these statistics across three distinct categories: numerical data, categorical attributes, and date-based variables.

(A.) Numerical Variables

(i.) Numerical variables

sub_dataset_final

Run to view results

# generating summary statistics for numerical columns including sales, quantity, discount, profit, and returned items from the 'sub_dataset_final' numerical_col = ['sales', 'quantity', 'discount', 'profit', 'returned'] numerical_summary = sub_dataset_final[numerical_col].describe() print (numerical_summary)

Run to view results

The sales data exhibits an average amount of 75.50 USD with a standard deviation of 92.21 USD, suggesting a varied spread around this mean. Similarly, the average number of units sold is 3.40, with a slight deviation of 1.87, indicating consistency in sales volume. The discount offered averages at 9.67% with a standard deviation of 10.63%, showcasing variability in pricing strategies. Regarding profits, the average per transaction stands at 13 USD, with a standard deviation of 17.18 USD, indicating fluctuations in profitability across transactions.

Furthermore, there are identifiable ranges within which most purchases, quantity sold, and profits fall. Alongside these averages, the minimum and maximum values for each metric provide additional context to the data's distribution and variability.

Concludingly, the rate of returns stands at 8.1%, indicating that the vast majority of sales remain intact without being returned.

-> Advices:-

In light of these discoveries, it's advisable for the company to delve deeper into their sales data to uncover underlying patterns or emerging trends. It's equally crucial for them to explore the interconnections between different variables, seeking out any correlations that could be influencing their sales and profitability.

Moreover, a strategic move would involve tailoring discounts more precisely to customers' historical purchasing behaviours. This personalised approach has the potential to bolster both sales figures and overall profits.

Lastly, it's imperative for the company to maintain vigilance over their return rate, ensuring it stays within acceptable bounds. This ongoing monitoring guarantees continued customer satisfaction and sustains the company's reputation for reliability.

Drawing from these insights, several suggestions emerge for further analysis and potential actions:

• One proposed course of action stemming from these findings involves delving deeper into the distribution of sales and profits. This entails scrutinising whether specific subcategories or products consistently yield higher or lower profits. Additionally, examining the distribution of returns, characterized by a relatively low mean yet a notably high maximum value of 1, could unveil insights. Such analysis might unveil products or categories more prone to returns, potentially impacting overall profitability.

• Another avenue for exploration could entail a more nuanced examination of the correlation between discounts and profits. Are there specific products or categories where offering discounts notably influences sales or profitability? Are particular discount levels consistently associated with higher or lower profits? By dissecting this relationship, potential strategies for fine-tuning discounts and enhancing profitability could be identified.

(ii.) Comparison numerical

# determine the summary statistics individually for each category # using groupby() method to grroup by many columns, Subsequently, carry out an operation on the obtained results category_summary = sub_dataset_final.groupby('category')[numerical_col].describe() # determine the summary statistics individually for each sub-category sub_category_summary = sub_dataset_final.groupby('sub_category')[['sales', 'quantity', 'discount', 'profit', 'returned']] # print summary statistics print ('summary statistics for each category:') print (category_summary) print (' ') print ('\n summary statistics for each sub-category:') print (sub_category_summary)

Run to view results

The dataset comprises three distinct categories: office supplies, technology, and furniture. Among these, office supplies exhibit the lowest average sales at 48.68 USD, while furniture boasts the highest average sales of 131.70 USD, coupled with the widest sales variation represented by a standard deviation of 119.26 USD. Furniture also stands out with the highest median number of sales, clocking in at 3 units, and the most substantial profit margin at 70.72%.

In terms of subcategories, fasteners register the lowest mean sales at 12.94 USD, contrasting sharply with copiers, which lead with the highest mean sales at 479.98 USD. Machines emerge as noteworthy for their significant sales variability, indicated by the largest standard deviation of 133.81 USD. Additionally, machines command attention for their highest median quantity sold (4 units) and the topmost maximum profit margin of 70.00%, a position shared with storage and copiers.

It's worth noting that across all categories and subcategories, the incidence of returned goods remains relatively low.

-> Advice:-

Let's prioritise examining Furniture and Technology categories due to their consistently higher average sales when contrasted with Office Supplies. Within these categories, we should delve into specific sub-categories like Bookcases, Chairs, and Phones, which exhibit notably higher average sales compared to others.

Fortunately, our return rates remain relatively low across all categories and sub-categories, signalling positive customer satisfaction. Nonetheless, it's crucial to maintain vigilance over return rates and delve into the reasons behind any returned products.

It's worth noting that certain sub-categories, such as Copiers, Machines, and Tables, demonstrate higher standard deviations in sales, indicating greater sales volatility. This warrants deeper investigation to understand the underlying factors contributing to these fluctuations.

Additionally, the profit margins of select sub-categories appear to be suboptimal, suggesting a need to reassess our pricing strategies, explore avenues for cost reduction, or potentially discontinue these less profitable sub-categories."

(B.) Categorical Variables

# selecting categorical columns from the 'sub_dataset_final' dataset based on their data type being 'object' # using .select_dtypes to select columns from a dataframe based on the datatype categorical_columns = sub_dataset_final.select_dtypes(include = ['object']) # generating summary statistics for categorical columns from the 'categorical_columns' dataset summary_stats_categorical = categorical_columns.describe() # printing the summary statistics print (summary_stats_categorical)

Run to view results

The dataset comprises 7140 entries devoid of any missing data. Among these, 4183 unique order IDs signify that certain orders encompass multiple products. The orders originate from 787 distinct consumers residing in 508 cities and 48 states across the United States. It encompasses a total of 1675 distinct products, categorized into technology, office supplies, and furniture. Additionally, it highlights the most frequently occurring order ID, ship mode, customer ID, customer name, segment, state, product ID, category, subcategory, and product name.

-> Advices:-

Focus on the West, particularly California: Direct marketing efforts towards these regions as they contribute significantly to order volume. Strengthening brand presence and customer loyalty in these areas could lead to increased sales and revenue.

Prioritize Office Supplies and Paper products: These categories are highly popular, suggesting an opportunity to expand product offerings in these areas. Introducing innovative products or expanding existing lines could attract more customers and drive sales growth.

Target the Majority Consumer Segment: Most customers fall into this category. Invest in marketing strategies, promotions, and product development tailored to their preferences to stimulate growth and capture market share.

Improve shipping options: Since Standard Class shipping is preferred, consider offering incentives or discounts for faster delivery options. This could encourage customers to opt for expedited shipping, leading to improved satisfaction levels.

Implement customer retention strategies: Introduce loyalty programs, exclusive offers, or personalized marketing campaigns to engage existing customers and encourage repeat purchases, thus fostering long-term customer relationships.

Keep customers coming back by offering them rewards, special deals, or personalized promotions. This way, they'll feel valued and more likely to stick around for the long haul."

(C.) Date Variables

Let's analyze the data over time by computing summary statistics for the date variables, such as order date and shipping date, found in the dataset

# the order_date and ship_date have been adjusted to the datetime format # creating new columns for year, month, day, and day of the week based on the 'order_date' column in the dataset sub_dataset_final['order_year'] = sub_dataset_final['order_date'].dt.year sub_dataset_final['order_month'] = sub_dataset_final['order_date'].dt.month sub_dataset_final['order_day'] = sub_dataset_final['order_date'].dt.day sub_dataset_final['order_dayofweek'] = sub_dataset_final['order_date'].dt.dayofweek # creating new columns for year, month, day, and day of the week based on the 'ship_date' column in the dataset sub_dataset_final['ship_year'] = sub_dataset_final['ship_date'].dt.year sub_dataset_final['ship_month'] = sub_dataset_final['ship_date'].dt.month sub_dataset_final['ship_day'] = sub_dataset_final['ship_date'].dt.day sub_dataset_final['ship_dayofweek'] = sub_dataset_final['ship_date'].dt.dayofweek # extracting summary information about order dates and shipping dates into separate dataframes order_date_summary = sub_dataset_final[['order_year', 'order_month', 'order_day', 'order_dayofweek']] ship_date_summary = sub_dataset_final[['ship_year', 'ship_month', 'ship_day', 'ship_dayofweek']] # printing summary statistics for order_date and ship_dates print ('order date summary statistics:') print (order_date_summary) print ('\n ship date summary statistics:') print (ship_date_summary)

Run to view results

The dataset captures information spanning four years, covering orders placed between 2014 and 2017, and shipments made between 2014 and 2018. Most orders and shipments occurred during the summer months, with average order_month and ship_month values close to 7, suggesting a seasonal trend. Additionally, the average order_day and ship_day values around 16 indicate a preference for mid-month transactions. Despite considerable variability in order_month and ship_month, indicating fluctuations over time, the standard deviation for order dayofweek and ship dayofweek is smaller, reflecting more uniform distribution across the week. The fact that minimum and maximum values for order_day and ship_day align with calendar days underscores the integrity of the data, indicating no missing or erroneous entries.

-> Advices:-

Capitalize on Seasonal Trends: Utilize the data showing heightened order activity in the latter half of the year, particularly from August to October. Implement targeted promotions and marketing campaigns during these months to leverage increased demand. Additionally, consider offering incentives to boost sales in the first half of the year.

Optimize Shipping Operations: Given the surge in orders during the second half of the year, ensure that your shipping processes can efficiently handle the heightened volume. This may involve expanding your workforce or partnering with additional shipping providers to manage increased demand effectively.

Streamline Order Processing: Enhance the efficiency of order processing by analyzing the time between order placement and shipment. Identify any patterns or delays in order processing and implement optimizations to expedite order fulfillment, thereby ensuring prompt delivery and enhancing customer satisfaction.

Evaluate Weekly Order Trends: Study how orders and shipments are distributed across different days of the week. Allocate adequate resources to handle order processing, customer service, and shipping, especially during peak days such as Tuesdays and Wednesdays.

Review Data Period: Examine data spanning from 2014 to 2017, with a few records from 2018, to assess the company's performance and identify areas for growth or enhancement. Analyze the year-over-year increase in order volume to gauge business success and identify opportunities for improvement.

4. Visualising the Dataset

Categorical Visualisation of Bar Chart

# pivot the DataFrame to have regions as separate columns orders_by_region_category_pivot = sub_dataset_final.pivot_table(index='category', columns='region', values='order_id', aggfunc='nunique') # generating a graphical representation using stacked bars plt.figure(figsize=(10, 6)) ax = orders_by_region_category_pivot.plot(kind='bar', stacked=True) plt.xlabel('Product Category') plt.ylabel('Number of Orders') plt.title('Regional Distribution of Orders Across Product Categories') plt.legend(title='Region') # tilt x-axis labels slightly ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right') plt.show()

Run to view results

The above bar chart shows the distribution of orders across product categories, but broken down by region. It appears that technology is the most popular category across all regions, followed by furniture and office supplies. Central region seems to have the most orders overall, followed by East, South and West.

# calculate total sales by category and subcategory category_sales = sub_dataset_final.groupby(['category', 'sub_category'])['sales'].sum() # plot the data plt.figure(figsize=(12, 6)) # Increase figure width to accommodate larger x-axis ax = category_sales.unstack().plot(kind='bar', stacked=True) plt.xlabel('Product Category') plt.ylabel('Total Sales') plt.title('Total Sales by Product Category and Subcategory') # move the legend to the right plt.legend(loc='center left', bbox_to_anchor=(1, 0.5)) # tilt x-axis labels slightly for better readability ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right') # increase the x-axis length by adjusting margins plt.margins(x=0.02) # show the plotted data plt.show()

Run to view results

This graph displays total sales by product category and subcategory. We can see that Furniture is the leading category, followed by Technology and Office Supplies. Within the Furniture category, Chairs and Bookcases appear to be top sellers.

# grouping the data category_subcategory_totals = sub_dataset_final.groupby(['category', 'sub_category']).agg({'profit': 'sum', 'sales': 'sum'}) # plotting category_subcategory_totals.plot(kind='bar', stacked=True) plt.xlabel('Product Category and Subcategory') plt.ylabel('Total') plt.title('Total Profit and Sales by Product Category and Subcategory') # adjusting figure size plt.gcf().set_size_inches(10, 6) # Set width to 10 inches and height to 6 inches # show the plot plt.show()

Run to view results

The graph shows total profit and sales by product category and subcategory. The blue line represents the profit, and the orange line represents the sales. Furniture is the leading category in both sales and profit, with Chairs and Bookcases being the top subcategories. Overall, sales appear to be higher than profits, which means the profit margin on these products may be low.

subcategory_profit = sub_dataset_final[['sub_category','profit']].sum() subcategory_profit['sub_category'] = ['Accessories','Appliances','Art','Binders','Bookcases','Chairs','Copiers','Envelopes','Fasteners','Furnishings','Labels','Machines','Paper','Phones','Storage','Supplies','Tables'] subcategory_profit

Run to view results

Continuous Data: Scatter Plot Analysis

# assuming you have sales data stored in a DataFrame named 'sales_data' and you want to aggregate it by date # assuming the DataFrame has a column 'order_date' containing the date information and 'sales' containing sales data # aggregate sales by date sales_by_date = sub_dataset_final.groupby('order_date')['sales'].sum() # plotting the scatter plot of sales over time plt.scatter(sales_by_date.index, sales_by_date.values) # labeling the x-axis, y-axis, and title plt.xlabel('Date') plt.ylabel('Sales') plt.title('Total Sales Over Time') # setting x-axis ticks and rotating them for better readability plt.xticks(sales_by_date.index[::60], rotation=90) # displaying the plot plt.show()

Run to view results

The above scatter graph depicts total sales over time, there is been an upward trend in sales over time. There are fluctuations throughout the period, but the general trend is positive. Sales appear to start around 500 and end around 3000

# creating a subset of the dataset containing records within a specified date range start_date = '2016-09-05' end_date = '2017-01-21' mask = (sub_dataset_final['order_date'] >= start_date) & (sub_dataset_final['order_date'] <= end_date) subset = sub_dataset_final.loc[mask] # axis labels plt.xlabel('Sales') plt.ylabel('Profit') plt.title('Sales and Profit between 2016/09/05 to 2017/01/21') # scatter plot visualizing sales, profit, and quantity relationship plt.scatter(subset['sales'], subset['profit'], s=subset['quantity']*10, c=subset['quantity'], cmap='viridis') plt.show()

Run to view results

The graph shows a scatter plot, where each data point represents the sales and profit for a specific product category. There appears to be a positive correlation between sales and profit, meaning that categories with higher sales tend to also have higher profits. One data point in the upper right corner stands out from the rest, indicating a category with both high sales and high profits

5. Modeling Strategy for Profitability Analysis

K-means clustering is an excellent choice for analyzing the factors contributing to the profitability of the company because it allows us to identify distinct groups or clusters within the data based on similarities in their features. Here's a detailed justification for selecting k-means clustering for this analysis:

Unsupervised Learning: K-means clustering is an unsupervised learning algorithm, meaning it doesn't require labeled data for training. In the context of profitability analysis, we may not have predefined labels for what constitutes a profitable or non-profitable segment. K-means allows us to uncover existing patterns and groupings within the data without prior knowledge of what those groupings might be.

Identifying Patterns: Profitability is influenced by a lot of factors such as sales, expenses, market segments, and customer behavior. K-means clustering helps in identifying patterns within these factors by grouping together data points that exhibit similar characteristics in terms of sales, profit, quantity, etc. This can provide insights into which combinations of factors contribute most significantly to profitability.

Scalability and Efficiency: K-means clustering is computationally efficient and can handle large datasets with ease. This is crucial for analyzing company profitability, which often involves dealing with substantial amounts of data spanning various dimensions such as sales figures, costs, and other financial metrics.

Interpretability: K-means clustering produces easily interpretable results. Each cluster represents a distinct segment within the data, and the characteristics of each cluster can be analyzed to understand the factors driving profitability within that segment. This makes it straightforward to communicate findings to stakeholders and decision-makers.

Flexibility and Adaptability: K-means clustering is versatile and can accommodate different types of data and features. Whether the dataset includes numerical or categorical variables, K-means can effectively partition the data into clusters based on the chosen features. This flexibility is advantageous when dealing with diverse datasets, as is often the case in business analytics.

Visual Representation: K-means clustering allows for intuitive visualization of clusters, which aids in the interpretation of results. Through scatter plots or other visualizations, we can observe how different clusters are distributed in the feature space and gain insights into the relationships between variables.

import pandas as pd from sklearn.cluster import KMeans import numpy as np import matplotlib.pyplot as plt # drop non-numeric columns and then perform clustering sub_dataset_final_numeric = sub_dataset_final.select_dtypes(include=[np.number]) # calculate the Inertia or Within Cluster Sum of Squared Errors (WSS) for different values of k wcss = [] for k in range(1, 11): kmeans = KMeans(n_clusters=k, init="k-means++", random_state=42) kmeans.fit(sub_dataset_final_numeric) wcss.append(kmeans.inertia_) # plot the elbow method graph plt.figure(figsize=(12, 6)) plt.grid() plt.plot(range(1, 11), wcss, linewidth=2, color="red", marker="8") # axis labels plt.xlabel("K Value") plt.xticks(np.arange(1, 11, 1)) plt.ylabel("WCSS") plt.title('Elbow Method for Optimal K') # displaying plt.show()

Run to view results

This graph displays the Within Cluster Sum of Squared Errors (WCSS) against different values of K. The elbow method is used to find the optimal number of clusters for KMeans clustering. The plot shows a downward trend in WCSS as the number of clusters increases. We identify the "elbow" point where the rate of decrease slows down significantly. This point represents the optimal number of clusters, where adding more clusters doesn't provide much improvement in WCSS. In this graph, the elbow point is where K=6, suggesting that 6 clusters might be the optimal choice for this dataset.

sub_dataset_final

Run to view results

# create a kmeans object called km with 6 clusters on df # create the clusters and assign to an object 'clusters' by using fit_predict km = KMeans(n_clusters=6,random_state=42) clusters = km.fit_predict(sub_dataset_final_numeric) sub_dataset_final_numeric["label"] = clusters # 3D scatter plot showing sales, profit, and quantity relationship categorized by labels fig = plt.figure(figsize=(20,10)) ax = fig.add_subplot(111, projection='3d') ax.scatter(sub_dataset_final_numeric['sales'][sub_dataset_final_numeric.label == 0], sub_dataset_final_numeric["profit"][sub_dataset_final_numeric.label == 0], sub_dataset_final_numeric["quantity"][sub_dataset_final_numeric.label == 0], c='blue', s=60) ax.scatter(sub_dataset_final_numeric['sales'][sub_dataset_final_numeric.label == 1], sub_dataset_final_numeric["profit"][sub_dataset_final_numeric.label == 1], sub_dataset_final_numeric["quantity"][sub_dataset_final_numeric.label == 1], c='red', s=60) ax.scatter(sub_dataset_final_numeric['sales'][sub_dataset_final_numeric.label == 2], sub_dataset_final_numeric["profit"][sub_dataset_final_numeric.label == 2], sub_dataset_final_numeric["quantity"][sub_dataset_final_numeric.label == 2], c='green', s=60) ax.scatter(sub_dataset_final_numeric['sales'][sub_dataset_final_numeric.label == 3], sub_dataset_final_numeric["profit"][sub_dataset_final_numeric.label == 3], sub_dataset_final_numeric["quantity"][sub_dataset_final_numeric.label == 3], c='orange', s=60) ax.scatter(sub_dataset_final_numeric['sales'][sub_dataset_final_numeric.label == 4], sub_dataset_final_numeric["profit"][sub_dataset_final_numeric.label == 4], sub_dataset_final_numeric["quantity"][sub_dataset_final_numeric.label == 4], c='purple', s=60) ax.scatter(sub_dataset_final_numeric['sales'][sub_dataset_final_numeric.label == 5], sub_dataset_final_numeric["profit"][sub_dataset_final_numeric.label == 5], sub_dataset_final_numeric["quantity"][sub_dataset_final_numeric.label == 5], c='black', s=60) ax.view_init(30, 185) ax.view_init(30, 185) # axis labels plt.xlabel("sales") plt.ylabel("profit") ax.set_zlabel('quantity') # displaying plt.show()

Run to view results

This 3D scatter plot represents the relationship between sales, profit, and quantity, where each data point is assigned a color corresponding to its cluster label. The clusters are generated using KMeans clustering with six clusters. By visually inspecting the plot, we can observe how data points within each cluster are distributed across these three dimensions. The spatial arrangement of points provides insights into the segmentation of the dataset and potential patterns or trends within each cluster. This visualization helps in understanding the distinct groupings present in the data and can guide further analysis or decision-making processes.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Done by, ZAHID BIN JAMSHEED (H00440696)

Coursework 2 (C11BD)

INTRODUCTION

1. Import the Data

2. Cleaning the Data

(i.) Renaming Columns

(ii.) Remove unnecessary Columns

(iii.) Convert the types of datas

(iv.) Dealing with null entries

(v.) Recognise and discard duplicated records

(vi.) Detection of outliers

(vii.) IQR (Interquartile Range) method

(viii.) Removing outliers

New cleaned dataset

Key Insights from cleaned Dataset

3. Summary statistics of cleaned data

(A.) Numerical Variables

(B.) Categorical Variables

(C.) Date Variables

4. Visualising the Dataset

Categorical Visualisation of Bar Chart

Continuous Data: Scatter Plot Analysis

5. Modeling Strategy for Profitability Analysis

Done by, ZAHID BIN JAMSHEED (H00440696)