Coursework 2

Aleena Shaju

H00446410

18/03/2024

INTRODUCTION

Python is extensively utilized for company data analysis because of its adaptability and robust libraries tailored for data manipulation, examination, and visualization. Here, Python is employed to scrutinize company data with the aim of enhancing profitability through data analytics. The process begins with data collection, where the company's data is imported into Python, either via libraries like Pandas or through direct database connections. Subsequently, the data undergoes cleaning and preprocessing to ensure its quality and reliability, involving tasks such as handling missing values, eliminating duplicates, and standardizing formats. Python's Pandas library offers efficient tools for this task, facilitating the preparation of data for analysis. Once the data is cleaned and preprocessed, exploratory data analysis (EDA) is conducted. EDA entails visually and statistically exploring the data to unveil patterns, trends, and insights that can guide strategic decision-making. Visualization libraries like Matplotlib, Seaborn, or Plotly are utilized to create visual representations that aid in understanding key metrics and relationships within the data. Customer segmentation emerges as a pivotal application of data analytics in boosting profitability. By employing the KMeans clustering algorithm on the cleaned dataset, distinct clusters of data points representing varying levels of profitability are identified. Utilizing KMeans clustering on factors such as quantity, sales, and discount enables businesses to extract valuable insights, potentially leading to profit improvement.

1. DATA IMPORTING

This process involves loading external datasets into the Python environment.

#Importing datasets in Python typically involves using libraries like Pandas or NumPy #import pandas library for data analysis import pandas as pd #import matplotlib.pyplot module for data visualization import matplotlib.pyplot as plt

Run to view results

#importing the data "dataset_Superstore.csv" #save it to an object named "ds" ds=pd.read_csv("dataset_Superstore.csv") ds

Run to view results

#inorder to get the rows and columns count print('shape:',ds.shape)

Run to view results

GENERAL DATA INFO

i) SUMMARY INFORMATION

#inorder to get the non-null values and data type of each column, we are using: print(ds.info())

Run to view results

ii) DESCRIPTIVE STATISTICS: to obtain understanding of the data's variability, central tendency, and distribution in numerical columns

ds.describe()

Run to view results

2. CLEANING THE DATA

In order to make sure the data is correct, comprehensive, consistent, and trustworthy for analysis, a number of processes and techniques are used in data cleaning.

Before cleaning the datasets, the different columns in the data frame is renamed as in the coursework instruction.

ds1 = ds.rename (columns = { 'Row ID': 'row_id', 'Order ID': 'order_id', 'Order Date': 'order_date', 'Ship Date': 'ship_date', 'Ship Mode': 'ship_mode', 'Customer ID': 'customer_id', 'Customer Name': 'customer_name', 'Customer_no': 'customer_no', 'Segment': 'segment', 'Segment_no': 'segment_no', 'Country': 'country', 'City': 'city', 'State': 'state', 'State_no':'state_no', 'Postal Code': 'postal_code', 'Region': 'region', 'Region_no': 'region_no', 'Product ID': 'product_id', 'Category': 'category', 'Category_no': 'category_no', 'Sub-Category': 'sub_category', 'Sub-Category_no': 'sub_category_no', 'Product Name': 'product_name', 'Product Name_no': 'product_name_no', 'Sales': 'sales', 'Quantity': 'quantity', 'Discount': 'discount', 'Profit': 'profit', 'Returned': 'returned', })

Run to view results

i) Removing unnecessary columns

#irrelevant columns are removed from the dataset ds1=ds1.drop(['customer_no','state_no','region_no','segment_no','category_no','sub_category_no'], axis=1)

Run to view results

ii) Removing Duplicates

#To find duplicates #using .duplicated() method is used to identify duplicate rows duplicates = ds1.duplicated() #using .sum() method is used to determine the total number of True values, which correlates to the total num_duplicates = duplicates.sum() print(f'Number of duplicates: {num_duplicates}') #using .drop_duplicates() method to remote duplicate rows ds1 = ds1.drop_duplicates() #verifying duplicates are removed duplicates = ds1.duplicated() num_duplicates = duplicates.sum() print(f'number of duplicates after removal: {num_duplicates}')

Run to view results

iii) Handling Missing Values: Dealing with Null entries

#using .isnull() method to check for any missing values in the dataset #using .sum() method to calculate the total number of missing values in each column print('missing values distribution:') print(ds1.isnull().sum()) print('')

Run to view results

In this dataset, there are no missing values in any of the columns, as indicated by zeros across all columns. The absence of missing values suggests that the dataset is complete in terms of data entry and does not contain any null values.

iv) Standardizing formats

For ensuring that date columns in the dataframe ds1 are stored as datetime objects, which allows for easier manipulation, comparison, and analysis of dates and times thus ensuring consistency in data formats across the dataset.

ds1['order_date']=pd.to_datetime(ds1['order_date']) ds1['ship_date']=pd.to_datetime(ds1['ship_date'])

Run to view results

v) Handling Outliers

Since outliers can have a substantial impact on statistical measures and model performance, identifying them in a dataset is an essential step in data analysis. To find data points that significantly depart from the central tendency, we are looking at the summary statistics for each numerical column, such as the mean, median, standard deviation, and percentiles.

ds1.describe()

Run to view results

Interquartile Range - IQR method

Calculating the IQR (the range between the 75th and 25th percentiles) can help identify outliers based on their deviation from the median of the dataset. Data points that fall outside a certain range (here 1.5 times the IQR) may be considered outliers

#identifying outliers using the IQR method def find_outliers_iqr(data, columns): outliers = pd.DataFrame() for column in columns: Q1 = data[column].quantile(0.25) Q3 = data[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers_column = (data[column] < lower_bound) | (data[column] > upper_bound) outliers = pd.concat([outliers, outliers_column], axis=1) outliers.columns = columns return outliers # outliers identification columns_to_check = ['quantity', 'sales', 'discount', 'profit'] outliers = find_outliers_iqr(ds1, columns_to_check) # displaying the count of outliers detected in each specified column for column in columns_to_check: print(f"Number of outliers in '{column}' column: {outliers[column].sum()}")

Run to view results

Removing Outliers

#remove outliers ds1_final = ds1[~outliers.any(axis = 1)] #displaying the number of outliers that have been eliminated print(f"number of outliers removed: {len(ds1) - len(ds1_final)}") #print the shape of the dataframe after removing outliers print(f"shape of the dataframe without outliers: {ds1_final.shape}")

Run to view results

Cleaned dataset

ds1_final.head() ds1_final

Run to view results

Basic insights from cleaned dataset

#max sales from which state count_state = ds1_final.groupby('state').size() print(count_state.sort_values())

Run to view results

#which state accounted for highest profit state_profit = ds1_final[['state','profit']].groupby('state').sum() print(state_profit.sort_values(by = 'profit'))

Run to view results

#which region accounted for highest profit state_profit = ds1_final[['region','profit']].groupby('region').sum() print(state_profit.sort_values(by = 'profit'))

Run to view results

#which product category are most sold in quantities sold_product = ds1_final[['category','quantity']].groupby('category').sum() print(sold_product.sort_values(by = 'category'))

Run to view results

#which category brings more profit product_avgprofit = ds1_final[['category','profit']].groupby('category').mean() print(product_avgprofit.sort_values(by = 'profit'))

Run to view results

#which product segment is returned the most print (ds1_final['returned'].describe()) print (ds1_final.groupby('category').size().sort_values()) return_prod = ds1_final[['category','returned']].groupby('category').sum() print(return_prod.sort_values(by = 'returned'))

Run to view results

3. EXPLORATORY DATA ANALYSIS

A. Exploring data statistically

i) NUMERICAL VARIABLES

Before determining the summary statistics of the cleaned data, the boolean values in the "returned" column is converted into integers.

# Convert boolean values to integers (1 for True, 0 for False) ds1_final['returned'] = ds1_final['returned'].astype(int) # Print the updated DataFrame to verify the changes print(ds1_final.head())

Run to view results

Summary statistics of 'sales', 'quantity', 'discount', 'profit', 'returned'

#Determining summary statistics of sales, quantity, discount, profit and returned numerical_col = ['sales', 'quantity', 'discount', 'profit', 'returned'] numerical_summary = ds1_final[numerical_col].describe() print(numerical_summary)

Run to view results

To measure the asymmetry in the distribution of data points around the mean, we use the 'skewness' parameter. The distribution of sales is right-skewed, as indicated by the higher mean compared to the median (50th percentile). The distribution of quantity appears to be relatively symmetric, with a median (50th percentile) equal to the mean. Approximately 8.10% of transactions involve returned items, as indicated by the mean value.

Summary statistics of individual category

#Determine the summary statistics individually for each category #using groupby() method to group by many columns. category_summary = ds1_final.groupby('category')[numerical_col].describe() #Determine the summary statistics individually for each sub-category sub_category_summary = ds1_final.groupby('sub_category')[['sales', 'quantity', 'discount', 'profit', 'returned']] #print summary statistics print ('summary statistics for each category:') print (category_summary) print (' ') print ('\n summary statistics for each sub-category:') print (sub_category_summary)

Run to view results

Furniture has the highest mean sales amount per transaction. However, Furniture also exhibits the highest variability in sales, as indicated by the highest standard deviation. The mean quantity per transaction is relatively similar across all categories, ranging from approximately 3.13 to 3.52 items. The mean quantity per transaction is relatively similar across all categories, ranging from approximately 3.13 to 3.52 items. There is minimal variability in quantity within each category, as indicated by the comparable standard deviations. Furniture has the highest mean profit per transaction. The variability in profit margins is evident, with Furniture and Technology exhibiting higher standard deviations compared to Office Supplies.

ii) CATEGORICAL VARIABLES

Categorical variables are non-numeric data types that represent categories or labels.

# Selecting Categorical Columns categorical_columns = ds1_final.select_dtypes(include=['object']) # to compute the Summary Statistics summary_stats_categorical = categorical_columns.describe() # Printing Summary Statistics print(summary_stats_categorical)

Run to view results

There are a total of 7,140 orders in the dataset. The dataset contains 4,184 unique order IDs, indicating that some orders may have multiple items. The majority of customers belong to the "Consumer" segment. All orders in the dataset are from the United States. "New York City" is the most frequent city, appearing 709 times, and "California" is the most frequent state, appearing 1,606 times.

iii) DATE VARIABLES

# Extracting date components from order_date ds1_final['order_year'] = ds1_final['order_date'].dt.year ds1_final['order_month'] = ds1_final['order_date'].dt.month ds1_final['order_day'] = ds1_final['order_date'].dt.day ds1_final['order_dayofweek'] = ds1_final['order_date'].dt.dayofweek # Extracting date components from ship_date ds1_final['ship_year'] = ds1_final['ship_date'].dt.year ds1_final['ship_month'] = ds1_final['ship_date'].dt.month ds1_final['ship_day'] = ds1_final['ship_date'].dt.day ds1_final['ship_dayofweek'] = ds1_final['ship_date'].dt.dayofweek # Creating summary dataframes for order and ship dates order_date_summary = ds1_final[['order_year', 'order_month', 'order_day', 'order_dayofweek']] ship_date_summary = ds1_final[['ship_year', 'ship_month', 'ship_day', 'ship_dayofweek']] # Printing summary statistics for order and ship dates print('order date summary statistics:') print(order_date_summary) print('\nship date summary statistics:') print(ship_date_summary)

Run to view results

B. Exploring Data Visually

i) PLOTTING BAR CHART - CATEGORICAL

import matplotlib.pyplot as plt # Pivot the DataFrame to have regions as separate columns orders_by_region_category_pivot = ds1_final.pivot_table(index='category', columns='region', values='order_id', aggfunc='nunique') # Generating a graphical representation using stacked bars plt.figure(figsize=(10, 6)) # Adjust the figure size as needed orders_by_region_category_pivot.plot(kind='bar', stacked=True) plt.xlabel('Product Category') plt.ylabel('Number of Orders') plt.title('Regional Distribution of Orders Across Product Categories') plt.legend(title='Region') # Add a legend for regions plt.show()

Run to view results

The above figure is a stacked bar chart showing the distribution of orders across different product categories for each region. Each bar represents a product category, and each segment within the bar represents the number of orders in a specific region. Office supplies is the most popular product category across all regions and Technology is the least popular category. There seems to be some regional variation in demand. For example, orders for Furniture appear to be higher in the east and west regions compared to the central and south regions. This could be due to factors such as the demographics of the population in each region

import matplotlib.pyplot as plt # Calculate total sales by category and subcategory category_sales = ds1.groupby(['category', 'sub_category'])['sales'].sum() # Plot the data plt.figure(figsize=(12, 6)) # Increase figure width to accommodate larger x-axis category_sales.unstack().plot(kind='bar', stacked=True) plt.xlabel('Product Category') plt.ylabel('Total Sales') plt.title('Total Sales by Product Category and Subcategory') # Move the legend to the right plt.legend(loc='center left', bbox_to_anchor=(1, 0.5)) #showcasing the plotted data plt.show()

Run to view results

Office supplies is the highest-selling product category, with total sales exceeding 700,000. This suggests that office supplies are a major driver of sales for the company

import matplotlib.pyplot as plt # Grouping the data category_subcategory_totals = ds1_final.groupby(['category', 'sub_category']).agg({'profit': 'sum', 'sales': 'sum'}) # Plotting category_subcategory_totals.plot(kind='bar', stacked=True) plt.xlabel('Product Category and Subcategory') plt.ylabel('Total Profit and Sales') plt.title('Total Profit and Sales by Product Category and Subcategory') # Adjusting figure size plt.gcf().set_size_inches(9, 4) # Set width to 10 inches and height to 6 inches # Show the plot plt.show()

Run to view results

Envelopes, labels, and storage containers appear to be top subcategories within office supplies based on their profit slice. Appliances and machines appear to have among the lowest profit margins based on the graph. Some subcategories, like 'Office Supplies - Storage,' show very high sales but comparatively lower profit, indicating that this subcategory might have lower profit margins. Across most subcategories, sales figures are significantly higher than profit figures. This suggests that while revenue from sales is strong, the profit margins vary greatly, possibly due to different cost structures or pricing strategies for each subcategory. 'Technology - Phones' stands out as the subcategory with the highest profit as well as high sales, indicating it is a strong performer.

Plotting- Scatter chart (Continuous)

import matplotlib.pyplot as plt # Define start and end dates start_date = '2016-09-05' end_date = '2017-01-21' # Create a mask to filter data within the specified date range mask = (ds1_final['order_date'] >= start_date) & (ds1_final['order_date'] <= end_date) # Subset the DataFrame based on the mask subset = ds1_final.loc[mask] # Create a scatter plot plt.scatter(subset['sales'], subset['profit'], s=subset['quantity']*10, c=subset['quantity'], cmap='inferno') plt.xlabel('Sales') # Add x-axis label plt.ylabel('Profit') # Add y-axis label plt.title('Scatter Plot of Sales vs Profit with Quantity') plt.colorbar(label='Quantity') # Add color bar with label plt.show()

Run to view results

A scatter plot showing the relationship between 'sales' and 'profit', with the size and color of the points representing the 'quantity' of items sold. We are taking the data based on a specified date range, and then creating a scatter plot of 'sales' against 'profit'. There appears to be a positive correlation between sales and profit. As sales increases, profit also tends to increase. However, the relationship is not perfectly linear, indicating that other factors might affect profitability. Despite the overall positive trend, there are notable variations in profit at different levels of sales. For example, some data points show high profit at moderate sales levels, while others show low or even negative profit at the same sales level. This suggests that sales volume alone is not the sole determinant of profit. However, higher quantities (warmer colors) are mostly concentrated in the middle range of sales and profit, indicating that selling larger quantities does not necessarily correlate with the highest sales or profit. There are occurrences of negative profit (losses) across various sales levels. These could be due to high production or operational costs, discounts, or returns that are not compensated by the sales revenue. The top right section of the graph, where both sales and profit are high, is sparsely populated, indicating that while high profitability is achievable, it occurs less frequently.

4. MODELLING STRATEGY

K means Clustering

K-means clustering is an effective unsupervised machine learning technique for grouping related data points into a fixed number of groups, or clusters. Deciding on the right number of clusters (K) for the given dataset is one of the most important tasks in the K-means clustering process. The Elbow Method is one approach that can be used for this.

import pandas as pd from sklearn.cluster import KMeans import numpy as np import matplotlib.pyplot as plt #ds1_final is your dataset, let's drop non-numeric columns # and then perform clustering # Drop non-numeric columns (if any) ds1_final_numeric = ds1_final.select_dtypes(include=[np.number]) # Calculate the Inertia or Within Cluster Sum of Squared Errors (WSS) for different values of k wcss = [] for k in range(1, 11): kmeans = KMeans(n_clusters=k, init="k-means++", random_state=42) kmeans.fit(ds1_final_numeric) wcss.append(kmeans.inertia_) # Plot the elbow method graph plt.figure(figsize=(12, 6)) plt.grid() plt.plot(range(1, 11), wcss, linewidth=2, color="blue", marker="8") plt.xlabel("K Value") plt.xticks(np.arange(1, 11, 1)) plt.ylabel("WCSS") plt.title('Elbow Method for Optimal K') plt.show()

Run to view results

Visually inspect the elbow curve and identify the point where the decrease in WCSS slows down (i.e., the curve starts to bend like an elbow). This point indicates the optimal number of clusters (any value between 3-10)

# CODE HERE - create a kmeans object called km with 8 clusters on ds # create the clusters and assign to an object 'clusters' by using fit_predict km = KMeans(n_clusters=8,random_state=42) clusters = km.fit_predict(ds1_final_numeric) # Assign clusters to the dataset ds1_final_numeric["label"] = clusters # Create a 3D scatter plot fig = plt.figure(figsize=(15,8)) ax = fig.add_subplot(111, projection='3d') ax.scatter(ds1_final_numeric['profit'][ds1_final_numeric.label == 1], ds1_final_numeric["quantity"][ds1_final_numeric.label == 1], ds1_final_numeric["sales"][ds1_final_numeric.label == 1], c='red', s=60) ax.scatter(ds1_final_numeric['profit'][ds1_final_numeric.label == 2], ds1_final_numeric["quantity"][ds1_final_numeric.label == 2], ds1_final_numeric["sales"][ds1_final_numeric.label == 2], c='green', s=60) ax.scatter(ds1_final_numeric['profit'][ds1_final_numeric.label == 3], ds1_final_numeric["quantity"][ds1_final_numeric.label == 3], ds1_final_numeric["sales"][ds1_final_numeric.label == 3], c='orange', s=60) ax.scatter(ds1_final_numeric['profit'][ds1_final_numeric.label == 4], ds1_final_numeric["quantity"][ds1_final_numeric.label == 4], ds1_final_numeric["sales"][ds1_final_numeric.label == 4], c='purple', s=60) ax.scatter(ds1_final_numeric['profit'][ds1_final_numeric.label == 5], ds1_final_numeric["quantity"][ds1_final_numeric.label == 5], ds1_final_numeric["sales"][ds1_final_numeric.label == 5], c='black', s=60) # Set the viewing angle ax.view_init(40, 100) # Set labels and title ax.set_xlabel("Profit") ax.set_ylabel("Quantity") ax.set_zlabel('Sales') ax.set_title('Clustering of Profit, Quantity, and Sales') # Show the plot plt.show()

Run to view results

The Profit axis spans from approximately -40 to 60. There are data points that show negative profit, which could be of particular interest because they represent loss-making sales. The concentration of points above the zero line on the Profit axis implies that the majority of transactions were profitable. The Quantity axis ranges from 0 to about 9, and there is a dense cluster of points between the lower ranges of Quantity, Sales, and Profit. This could suggest that most transactions involve small quantities.

Recommendations

Examine Low-Profit and Loss-Making Sales: Look closely at the clusters where profit is low or negative. Determine if these losses are due to pricing, high costs, or other factors. Consider discontinuing products or services that consistently result in losses or re-evaluating the pricing strategy

Optimize Product Bundling: If certain clusters represent sales with high quantity but low profitability, consider whether bundling these products with higher-margin items could improve profits. This could encourage customers to purchase more profitable items alongside bulk products.

Customer Segmentation: Use the cluster information to segment customers according to their purchasing behavior. Tailor marketing campaigns and sales strategies to each segment to improve customer satisfaction and increase sales.

Inventory Management: Analyze clusters to manage inventory more effectively. Products that result in high sales and profit might need more inventory space, whereas low-selling, low-profit products may need to have reduced stock levels.

Cost Reduction Strategies: For clusters that have a high quantity of sales but lower profit margins, look into cost reduction strategies such as negotiating better terms with suppliers, reducing production costs, or finding more efficient distribution methods.

Price Adjustment: Reassess the pricing strategy for clusters that show high sales volume but low profit, as there may be room for a price increase without significantly affecting sales volume. Conversely, for clusters with low sales volume, consider whether a price reduction could stimulate demand.

Optimizing Pricing Strategies: Clustering customers based on the discounts they receive can help businesses understand the price sensitivity of different customer segments. By analyzing clusters with high discount rates, businesses can determine whether discounts are effectively driving sales or if they are eroding profit margins. Adjustments to pricing strategies can then be made to maximize profitability while maintaining customer satisfaction.

Product Development: Use insights from the clusters to inform product development. If a cluster with high profit margins is identified, consider developing similar or complementary products to capitalize on this success.

Analyze External Factors: Consider external factors that might affect certain clusters, such as seasonality, market trends, or economic conditions. Adjust business strategies to anticipate and respond to these factors.

Run to view results

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;} Coursework 2

INTRODUCTION

1. DATA IMPORTING

GENERAL DATA INFO

2. CLEANING THE DATA

i) Removing unnecessary columns

ii) Removing Duplicates

iii) Handling Missing Values: Dealing with Null entries

iv) Standardizing formats

v) Handling Outliers

Interquartile Range - IQR method

Removing Outliers

Cleaned dataset

3. EXPLORATORY DATA ANALYSIS

.css-o90n1z{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;-webkit-text-decoration:underline;text-decoration:underline;}A. Exploring data statistically

B. Exploring Data Visually

Plotting- Scatter chart (Continuous)

4. MODELLING STRATEGY

K means Clustering

Recommendations

Coursework 2

A. Exploring data statistically