C11BD- Coursework 2

H00435896

Introduction:

In today's data-driven business world, companies are increasingly relying on data analytics to gain valuable insights and make informed decisions. This report focuses on the application of big data analytics to enhance profitability for the company. The dataset provided, Superstore.csv, contains detailed information about customer orders, products, and sales. The objective of this analysis is to utilize data cleaning, descriptive statistics, data visualization, and modeling techniques to identify factors that significantly impact the company's profitability. By uncovering patterns, trends, and relationships within the data, the report aims to provide valuable insights which will improve the company's financial performance.

Methodology

1. Data Import and Understanding

The first step in our methodology is to import the provided dataset, dataset_Superstore.csv, into our Python environment. We will utilize appropriate libraries, such as pandas, to read the CSV file and store it as a dataframe for further analysis. This step also includes gaining a thorough understanding of the dataset's structure, variables, and their descriptions using appropriate technique.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns

Run to view results

df = pd.read_csv('dataset_Superstore.csv') df

Run to view results

df.shape

Run to view results

df.info()

Run to view results

df.head()

Run to view results

df.tail()

Run to view results

df.dtypes

Run to view results

2. Data Cleaning

Data cleaning is essential to ensure the accuracy and reliability of our analysis. In this step, we will identify and address any data entry errors, missing values, or outliers present in the dataset. Techniques such as removal of duplicate records, and handling missing values will be applied based on the the data based on the specific requirements of the analysis (Bennihi, Zirari and Medjahed, 2022).

Checking for Missing Values and handling them

As we can observe from the result there are no missing values in our dataset in any of the column.

missing_Val= df.isnull().sum() missing_Val

Run to view results

df = df.dropna()

Run to view results

Checking for Duplicates

We then check for any duplicated values and found that there are no duplicated values.

duplicate_Val= df.duplicated() duplicate_Val

Run to view results

duplicated_rows = df[df.duplicated()] duplicated_rows

Run to view results

df.dtypes

Run to view results

df.describe()

Run to view results

Converting Columns to appropriate format

# Convert date columns to the appropriate format df['Order Date'] = pd.to_datetime(df['Order Date'], dayfirst=True) df['Ship Date'] = pd.to_datetime(df['Ship Date'], dayfirst=True)

Run to view results

Converting Returned column from Boolean to Int for further analysis

#Converting Returned column from Boolean to Int by mapping 1 for True and 0 for False df['Returned'] = df['Returned'].map({True: 1, False: 0}) df.dtypes

Run to view results

Summarizing the statistic of the Cleaned data

df.describe()

Run to view results

3.Exploratory Data Analysis

Finding the percentage of transaction which are profit and loss making.

# Calculate the percentage of products with positive or negative profit total_Transaction = len(df) profit_Transaction = len(df[df['Profit'] > 0]) loss_Transaction = len(df[df['Profit'] < 0]) percentage_profit = (profit_Transaction / total_Transaction) * 100 percentage_loss = (loss_Transaction / total_Transaction) * 100 print("Percentage of Loss making Transaction:",percentage_loss) print("Percentage of Profit making Transaction:",percentage_profit)

Run to view results

Profit/Loss on each product

profit_by_product = df.groupby('Product ID')['Profit'].sum() # Create a new DataFrame with product ID and total profit profit_product = pd.DataFrame({'Product ID': profit_by_product.index, 'Total Profit': profit_by_product.values}) # Print the new DataFrame profit_product

Run to view results

Finding the percentage of product that are loss making

# Calculate the total number of products total_products = len(profit_product) # Count the number of loss-making products (where Total Profit is negative) loss_products = (profit_product['Total Profit'] < 0).sum() # Calculate the percentage of loss-making products percentage_loss_products = (loss_products / total_products) * 100 print("Percentage of Loss making Product:",percentage_loss_products)

Run to view results

Visualizing the correlation between all the relevant columns using the correlation matrix.

df_numeric = df.drop(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID','State_no','Region_no','Postal Code', 'Customer Name', 'Customer_no', 'Segment', 'Country', 'City', 'State', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Sales', 'Discount', 'Quantity'], axis=1) sns.heatmap(df_numeric.corr(), annot =True)

Run to view results

plt.figure(1) plt.figure(figsize=(20,9)) # Column (Ship Mode) plt.subplot(221) df['Ship Mode'].value_counts().plot(kind='bar',color='crimson',rot=0) plt.title('Ship Mode') # Column (Segment) plt.subplot(222) df['Segment'].value_counts().plot(kind='bar',color='orange',rot=0) plt.title('Segment') # Column (Region) plt.subplot(223) df['Region'].value_counts().plot(kind='bar',color='green',rot=0) plt.title('Region') # Column (Category) plt.subplot(224) df['Category'].value_counts().plot(kind='bar',color='blue',rot=0) plt.title('Category')

Run to view results

From the visualization we found the most preferred Ship mode, and order distribution between different region, segment and category.

# Create a figure with three subplots fig, axes = plt.subplots(1, 3, figsize=(18, 6)) # Plot bar plot for the top 10 most frequent cities df['City'].value_counts().head(10).plot(kind='bar', ax=axes[0]) axes[0].set_title('Top 10 Most Frequent Cities') axes[0].set_xlabel('City') axes[0].set_ylabel('Order') # Plot bar plot for average sales by city df.groupby('City')['Sales'].mean().nlargest(10).plot(kind='bar', ax=axes[1]) axes[1].set_title('Top 10 Cities with Highest Average Sales') axes[1].set_xlabel('City') axes[1].set_ylabel('Average Sales') # Plot bar plot for returns by city df[df['Returned'] == 1]['City'].value_counts().head(10).plot(kind='bar', ax=axes[2]) axes[2].set_title('Top 10 Cities with Most Returns') axes[2].set_xlabel('City') axes[2].set_ylabel('Returns') plt.tight_layout() plt.show()

Run to view results

From the above chart we understand the best performing cities in terms of order frequency and Average sales. It also highlights the city with highest returns.

# Extract the relevant columns from the dataframe profit = df['Profit'] sales = df['Sales'] discount = df['Discount'] quantity = df['Quantity'] # Create subplots with 1 row and 3 columns fig, axes = plt.subplots(1, 3, figsize=(18, 6)) # Scatter plot of "Profit" versus "Sales" axes[0].scatter(sales, profit) axes[0].set_xlabel('Sales') axes[0].set_ylabel('Profit') axes[0].set_title('Profit vs Sales') # Scatter plot of "Profit" versus "Discount" axes[1].scatter(discount, profit) axes[1].set_xlabel('Discount') axes[1].set_ylabel('Profit') axes[1].set_title('Profit vs Discount') # Scatter plot of "Profit" versus "Quantity" axes[2].scatter(quantity, profit) axes[2].set_xlabel('Quantity') axes[2].set_ylabel('Profit') axes[2].set_title('Profit vs Quantity') # Adjust the spacing between subplots plt.subplots_adjust(wspace=0.3) # Display the plots plt.show()

Run to view results

From the above scatter plots, we were able to infer that, profit and sales have in general a positive correlation, profit and discount has a negative correlation in general and Profit vs Quantity has a positive correlation in general.

df['Delivery Time']= df['Ship Date'] - df['Order Date'] df['Delivery Time'].describe()

Run to view results

In the above code, we first calculated a new column or feature called Delivery time, which is the time taken to deliver the products, we then calculated different stats like mean days taken to deliver which was almost four days.

shipping_modes = df['Ship Mode'].unique() average_order_values = [] for mode in shipping_modes: average = df[df['Ship Mode'] == mode]['Sales'].mean() average_order_values.append(average) print(f'Average order value for "{mode}" shipping mode is {average:.2f}')

Run to view results

# Calculate count of returns per ship mode return_count = df[df['Returned'] == True].groupby('Ship Mode')['Returned'].count() return_count

Run to view results

# Calculate average sales value and number of returns for each category average_sales_per_category = df.groupby('Category')['Sales'].mean() return_count_per_category = df.groupby('Category')['Returned'].sum() print("Average Sales per Category:") print(average_sales_per_category) print("\nNumber of Returns per Category:") print(return_count_per_category)

Run to view results

# Calculate average sales value average_sales_per_subcategory = df.groupby('Sub-Category')['Sales'].mean() average_sales_per_subcategory

Run to view results

#Calculate number of returns for each sub-category return_count_per_subcategory = df.groupby('Sub-Category')['Returned'].sum() return_count_per_subcategory

Run to view results

# Calculate average sales value and number of returns for each segment average_sales_per_segment = df.groupby('Segment')['Sales'].mean() return_count_per_segment = df.groupby('Segment')['Returned'].sum() print("Average Sales per Segment:") print(average_sales_per_segment) print("\nNumber of Returns per Segment:") print(return_count_per_segment)

Run to view results

# Calculate average sales value, number of returns, and count of ship modes for each region average_sales_per_region = df.groupby('Region')['Sales'].mean() return_count_per_region = df.groupby('Region')['Returned'].sum() ship_mode_count_per_region = df.groupby('Region')['Ship Mode'].value_counts().unstack() # Plot average sales value, number of returns, and count of ship modes per region side by side fig, axes = plt.subplots(1, 3, figsize=(18, 6)) average_sales_per_region.plot(kind='bar', ax=axes[0]) axes[0].set_title('Average Sales per Region') axes[0].set_xlabel('Region') axes[0].set_ylabel('Average Sales') return_count_per_region.plot(kind='bar', ax=axes[1]) axes[1].set_title('Number of Returns per Region') axes[1].set_xlabel('Region') axes[1].set_ylabel('Number of Returns') ship_mode_count_per_region.plot(kind='bar', ax=axes[2]) axes[2].set_title('Count of Type of Ship Mode used per Region') axes[2].set_xlabel('Region') axes[2].set_ylabel('Count of Ship Mode') plt.tight_layout() plt.show()

Run to view results

The above plot focuses on Average sales, number of returns and type of ship mode used per region. This can help us focus on most profitable region. Type of ship can be used to check if we can remove some of the options in a particular region if it is not used thus making the operational and logistical process easy for that region.

#Create a figure with four subplots fig, axes = plt.subplots(2, 2, figsize=(18, 12)) #Plot bar plot for top 10 selling products top_selling_products = df.groupby('Product Name')['Quantity'].sum().nlargest(10) top_selling_products.plot(kind='bar', ax=axes[0, 0]) axes[0, 0].set_title('Top 10 Selling Products') axes[0, 0].set_xlabel('Product') axes[0, 0].set_ylabel('Quantity') #Plot bar plot for top 10 selling products based on sales top_selling_products_sales = df.groupby('Product Name')['Sales'].sum().nlargest(10) top_selling_products_sales.plot(kind='bar', ax=axes[0, 1]) axes[0, 1].set_title('Top 10 Selling Products (by Sales)') axes[0, 1].set_xlabel('Product') axes[0, 1].set_ylabel('Sales') #Plot bar plot for top 10 selling products based on profit top_selling_products_profit = df.groupby('Product Name')['Profit'].sum().nlargest(10) top_selling_products_profit.plot(kind='bar', ax=axes[1, 0]) axes[1, 0].set_title('Top 10 Selling Products (by Profit)') axes[1, 0].set_xlabel('Product') axes[1, 0].set_ylabel('Profit') #Plot bar plot for top 10 returned products returned_products = df[df['Returned'] == 1].groupby('Product Name')['Quantity'].sum().nlargest(10) returned_products.plot(kind='bar', ax=axes[1, 1]) axes[1, 1].set_title('Top 10 Returned Products') axes[1, 1].set_xlabel('Product') axes[1, 1].set_ylabel('Quantity') plt.subplots_adjust(hspace=0.5, wspace=0.3) plt.tight_layout() plt.show()

Run to view results

Top selling products by different criteria along with top product which are returned. This allows us to focus on these product as either top selling product will help us increase the sales while focusing on the top returned product helps us in minimize these return and avoid any further loses. As any returned product would incur logistic losses.

# Calculate profit by category, sub-category, segment, and region profit_by_category = df.groupby('Category')['Profit'].sum().nlargest() profit_by_subcategory = df.groupby('Sub-Category')['Profit'].sum().nlargest() profit_by_segment = df.groupby('Segment')['Profit'].sum().nlargest() profit_by_region = df.groupby('Region')['Profit'].sum().nlargest() # Create a figure with four subplots fig, axes = plt.subplots(2, 2, figsize=(18, 12)) # Plot bar chart for top profitable category profit_by_category.plot(kind='barh', ax=axes[0, 0]) axes[0, 0].set_title('Top Profitable Categories') axes[0, 0].set_xlabel('Profit') axes[0, 0].set_ylabel('Category') # Plot bar chart for top profitable sub-category profit_by_subcategory.plot(kind='barh', ax=axes[0, 1]) axes[0, 1].set_title('Top Profitable Sub-Categories') axes[0, 1].set_xlabel('Profit') axes[0, 1].set_ylabel('Sub-Category') # Plot bar chart for top profitable segment profit_by_segment.plot(kind='barh', ax=axes[1, 0]) axes[1, 0].set_title('Top Profitable Segments') axes[1, 0].set_xlabel('Profit') axes[1, 0].set_ylabel('Segment') # Plot bar chart for top profitable region profit_by_region.plot(kind='barh', ax=axes[1, 1]) axes[1, 1].set_title('Top Profitable Regions') axes[1, 1].set_xlabel('Profit') axes[1, 1].set_ylabel('Region') plt.tight_layout() plt.show()

Run to view results

Most profitable category, sub-category, segment and region.

4. Outlier Detection and Handling

Dropping Columns

Dropping Columns which are non-numerical, redundant or not required. Many models don't work with non-numerical data and we might have to perform label encoding to use them, but in this case target columns already have their numerical substitute that can be utilized. For example, we already have a numerical column called segment number so we are removing the column called Segment as both of them represent the same thing.

#Dropping non-numerical and not required columns df.drop(['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Country', 'Postal Code', 'Region', 'Product ID', 'Product Name', 'City', 'Segment', 'State', 'Category', 'Sub-Category', 'Product Name_no', 'Customer_no', 'Order Date', 'Delivery Time'], axis=1, inplace=True) df.columns df.info()

Run to view results

Plotting box and whisker plot to detect Outlier

We are using box and whisker plot to detect and analyze the outliers. Box and whisker plot provides a clear and concise summary of the data distribution, including the presence of outliers, the range, quartiles, and median, thus allowing for an easier analysis(Iglewicz, 2011).

Outliers in this case are data points that fall outside the upper or lower whiskers of the box plot. The whiskers represent the range of data within 1.5 times the interquartile range (IQR) of the middle 50% of the data. Data points that fall outside this range are considered outliers (Dawson, 2011).

df.plot(kind='box', subplots=True, figsize=(20,20), layout=(6,5)) plt.show()

Run to view results

Examining the box plot, reveals that there are outliers for Profit, Sales, Discount, and Quantity. This means there are data points for these features that fall outside the typical range represented by the box and whiskers. In other words, there are instances of unusually high or low profits compared to the majority, exceptionally high or low sales figures, discounts that are significantly larger or smaller than usual, and quantities that deviate significantly from the rest of the data.

Note: Categorical features like Segment_no, State_no, Category_no, and Sub-Category_no usually have a limited number of categories and wouldn't necessarily show outliers in the same way.

Outlier Handling using Inter Quartile Range (IQR)

IQR measures how spread out the middle half of the data is. It doesn't take into account extreme values that might impact the overall range. Imagine the data points ordered from least to greatest. The IQR is essentially the difference between the value at the 75th percentile (Q3) and the value at the 25th percentile (Q1). So, it shows how much variation there is within the central 50% of the data set. In this case IQR provides an idea of how tightly clustered the middle values are, thus allowing us to detect and handle the outlier.

# Calculate the IQR for the columns Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 - Q1 print("Old Shape: ", df.shape[0]) # Remove rows where any of the values are greater than (Q3 + 1.5 * IQR) or less than (Q1 - 1.5 * IQR) df.drop(index=df[((df['Sales'] > (Q3['Sales'] + 1.5 * IQR['Sales'])) | (df['Profit'] > (Q3['Profit'] + 1.5 * IQR['Profit'])) | (df['Discount'] > (Q3['Discount'] + 1.5 * IQR['Discount'])) | (df['Quantity'] > (Q3['Quantity'] + 1.5 * IQR['Quantity']))) | ((df['Sales'] < (Q1['Sales'] - 1.5 * IQR['Sales'])) | (df['Profit'] < (Q1['Profit'] - 1.5 * IQR['Profit'])) | (df['Discount'] < (Q1['Discount'] - 1.5 * IQR['Discount'])) | (df['Quantity'] < (Q1['Quantity'] - 1.5 * IQR['Quantity'])))].index, inplace=True) print("NEW Shape: ", df.shape[0])

Run to view results

The above code first calculate the outlier by calculating the difference between Q3 and Q1. Then it checks if there are any values that falls outside the IQR range i.e., outlier, and finally removes the identified outliers. We have also checked the shape and size of data before and after the outlier handling. The new data frame has 7140 elements which is 2854 less than the original value thus indicating that we have handled any potential outlier (Vinutha, Poornima and Sagar, 2018).

5.Modelling

To holistically analyze the Superstore data, we'll utilize three of machine learning models. Linear regression tackles continuous relationships, while Random Forest handles into complex patterns and classifications. Finally, K-Means clustering finds hidden structures by grouping similar data points. This multi-model approach offers a rich understanding of the data, helping in sales predictions, customer segmentation and retention thus increasing the profit (Wasserbacher and Spindler, 2021).

!pip install mlxtend==0.23.1 #Installing relevant packages

Run to view results

from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score #cross validation from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, r2_score, mean_absolute_error, mean_absolute_percentage_error, mean_squared_error from mlxtend.plotting import plot_confusion_matrix #using standard scaler sc = StandardScaler() x = df.drop(['Sales'] , axis = 1).values y =df['Sales'].values #train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.33, random_state=40) x_train = sc.fit_transform(x_train) x_test = sc.fit_transform(x_test)

Run to view results

Linear Regression

from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error import numpy as np lr = LinearRegression() lr.fit(x_train, y_train) print("Linear Regression") print("Coefficient:", lr.coef_) print("Training set score:", lr.score(x_train, y_train)) print("Test set score:", lr.score(x_test, y_test)) y_pred = lr.predict(x_test) mse = mean_squared_error(y_test, y_pred) #Mean squared error rmse = np.sqrt(mse) #root mean squared error reg_score = r2_score(y_test, y_pred) #regression score mape = mean_absolute_percentage_error(y_test, y_pred) #mean absolute percentage error mae = mean_absolute_error(y_test, y_pred) #mean absolute error print('Regression Score:', reg_score) print('Mean Squared Error:', mse) print('Mean Absolute Percentage Error:', mape) print('Mean Absolute Error:', mae) print('Root Mean Squared Error:', rmse) # Plotting the predicted values against the true values fig = plt.figure(figsize=(5, 5)) ax = fig.add_subplot(111) ax.errorbar(y_test, y_pred, fmt='o') ax.errorbar([1, y_test.max()], [1, y_test.max()]) plt.show()

Run to view results

Result:

Model Performance:

Training set score (0.3798) and Test set score (0.3695) are R-squared values which indicate a moderate fit, explaining around 37-38% of the variance in sales for both the training and testing data. Regression Score (0.3695) is same as the test set score. Mean Squared Error (5208.96) and Root Mean Squared Error (72.17) are the error stats which shows the average squared difference and its square root between predicted and actual sales values. Lower values indicate better fit usually. Mean Absolute Percentage Error (1.84%) and Mean Absolute Error (48.00) are metrics which represent the average absolute difference between predicted and actual sales, with and without percentages. Lower value suggests better prediction accuracy.

Coefficient: the coefficient of a feature tell you the direction and strength of its relationship with the target variable. Positive coefficient represent positive relationship and vice versa

Conclusion: While the model seems to capture some trends in the data based on the R-squared and Mean Absolute Percentage Error, the overall fit might be not be entirely sufficient.

Random Forest Classification

from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, precision_score, recall_score # Classify profit column into three categories df['ProfitCategory'] = pd.cut(df['Profit'], bins=[float('-inf'), 0, float('inf')], labels=['Loss', 'None/Small Profit']) # Select relevant features for the classification model features = ['Segment_no', 'Region_no', 'Category_no', 'Sub-Category_no', 'Sales', 'Quantity', 'Discount'] # Select the target variable target = 'ProfitCategory' # Split the dataset into training and testing sets X = df[features] y = df[target] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40) # Create a Random Forest classifier clf = RandomForestClassifier(n_estimators=100, random_state=40) # Train the classifier clf.fit(X_train, y_train) # Make predictions on the test set y_pred = clf.predict(X_test) # Evaluate the model accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, average='weighted') recall = recall_score(y_test, y_pred, average='weighted') print("Accuracy:", accuracy) print("Precision:", precision) print("Recall:", recall)

Run to view results

Adding back Customer ID column to the df

df2 = pd.read_csv("dataset_Superstore.csv") # Create a new df and read the original csv file dropped_column = df2['Customer ID'] # Store the column value that I want to add back df['Customer ID'] = dropped_column #Add the column to the original dataframe df df

Run to view results

from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, precision_score, recall_score # Calculate return frequency per customer return_frequency = df.groupby('Customer ID')['Returned'].sum().reset_index() return_frequency.rename(columns={'Returned': 'ReturnFrequency'}, inplace=True) # Merge return frequency with the original dataset df = pd.merge(df, return_frequency, on='Customer ID', how='left') # Select relevant features for customer retention prediction features = ['Segment_no', 'Region_no', 'Category_no', 'Sub-Category_no', 'Sales', 'Quantity', 'Discount'] # Select the target variable target = 'ReturnFrequency' # Split the dataset into training and testing sets X = df[features] y = df[target] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40) # Create a Random Forest classifier clf = RandomForestClassifier(random_state=40, n_jobs=-1) # Define the parameter grid for hyperparameter tuning param_grid = { 'n_estimators': [1 ,5 ,10, 100], 'criterion': ['entropy', 'gini'], 'max_depth': [2, 5, 10, 20, 25] } # Perform grid search with cross-validation grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy', n_jobs=-1) grid_search.fit(X_train, y_train) # Get the best model and its parameters best_model = grid_search.best_estimator_ best_params = grid_search.best_params_ # Make predictions on the test set using the best model y_pred = best_model.predict(X_test) # Evaluate the best model accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, average='weighted') recall = recall_score(y_test, y_pred, average='weighted') print("Best Parameters:", best_params) print("Accuracy:", accuracy) print("Precision:", precision) print("Recall:", recall)

Run to view results

Result

We used Random Forest models for two distinct tasks: profit category prediction and customer retention prediction.

Model 1: Profit Category Prediction

This model aimed to classify profits on order into profit, Loss or None/Small Profit categories to classify highly profitable orders.

Performance: The model achieved an accuracy of 93%, indicating a high success rate in correctly classifying profit categories. Precision and recall were also impressive at around 93%, signifying a low chance of both false positives (incorrectly predicting a loss) and false negatives (missing actual losses).

Model 2: Customer Retention Prediction

This model aimed to predict customer return frequency, potentially aiding in customer retention efforts.

Performance: The model's accuracy was moderate at 64%. While recall (0.64) suggests it identifies a good amount of returning customers, the low precision (0.42) indicates a high rate of false positives. This means the model often incorrectly predicts customers who won't return frequently.

K Means Clustering

Adding Customer Name column back to the df

df3 = pd.read_csv("dataset_Superstore.csv") # Create a new df and read the original csv file dropped_column = df3['Customer Name'] # Store the column value that I want to add back df['Customer Name'] = dropped_column #Add the column to the original dataframe df df

Run to view results

from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler # Select relevant columns for clustering X = df[['Sales', 'Profit', 'Returned']] # Standardize the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Perform K-means clustering for different number of clusters sse = [] for k in range(1, 11): kmeans = KMeans(n_clusters=k, random_state=40) kmeans.fit(X_scaled) sse.append(kmeans.inertia_) # Plot the elbow curve with SSE values plt.plot(range(1, 11), sse) plt.xlabel('Number of Clusters') plt.ylabel('Sum of Squared Errors (SSE)') plt.title('Elbow Curve') for i, sse_value in enumerate(sse): plt.text(i+1, sse_value, f'SSE: {sse_value:.2f}', ha='center', va='bottom') plt.show()

Run to view results

from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler # Select relevant columns for clustering X = df[['Sales', 'Profit', 'Returned']] # Convert 'Returned' column to numeric values (e.g., 0 for 'No' and 1 for 'Yes') X['Returned'] = X['Returned'].map({'No': 0, 'Yes': 1}) # Fill any missing values in the data X = X.fillna(0) # Standardize the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X[['Sales', 'Profit', 'Returned']]) # Perform K-means clustering with 5 clusters kmeans = KMeans(n_clusters=5, random_state=40) kmeans.fit(X_scaled) # Add the cluster labels to the original dataset df['Cluster'] = kmeans.labels_ # Print the cluster assignments print(df[['Customer Name', 'Sales', 'Profit', 'Returned', 'Cluster']]) # Print the count of customers in each cluster cluster_counts = df['Cluster'].value_counts() print("Number of customers in each cluster:") print(cluster_counts)

Run to view results

Result

Finally we use clustering for customer segmentation. We first use elbow curve to determine the optimal value of k and then perform segmentation of customer into 5 categories. We have displayed how many customers are in each category. This allows for targeted promotion, marketing etc. to increase the profitability.

Conclusion

This exploration of the superstore data used a holistic approach, utilizing linear regression, random forest classification, and k-means clustering to gain a comprehensive understanding of factors influencing profitability. While linear regression provided a starting continuous relationships. Random forest, on the other hand, helped in uncovering complex patterns and classifications. For instance, the profit category prediction model using Random Forest achieved impressive accuracy (93%), allowing businesses to identify potential losses early and take corrective actions. K-means clustering, proved valuable in revealing hidden customer segments. By grouping customers with similar buying behaviors, this approach can be used for targeted marketing strategies and promotions, potentially leading to increased customer lifetime value. Customer retention model using Random Forest highlighted the potential of for predicting customer return frequency, a key metric for retention. Overall, this multi-model approach provides valuable insights which when implemented effectively can significantly enhance business profitability.

References

Bennihi, A.S., Zirari, B. and Medjahed, A.F.Z. (2022) 'Web Analytics and Business Performance: Data cleaning does matter,' in Springer eBooks, pp. 37–49. https://doi.org/10.1007/978-3-031-06971-0_4.

Iglewicz, B. (2011) 'Summarizing Data with Boxplots,' in Springer eBooks, pp. 1572–1575. https://doi.org/10.1007/978-3-642-04898-2_582.

Dawson, R.J.MacG. (2011) 'How Significant is a Boxplot Outlier?,' Journal of Statistics Education, 19(2). https://doi.org/10.1080/10691898.2011.11889610.

Vinutha, H.P., Poornima, B. and Sagar, B.M. (2018) 'Detection of Outliers Using Interquartile Range Technique from Intrusion Dataset,' in Advances in intelligent systems and computing, pp. 511–518. https://doi.org/10.1007/978-981-10-7563-6_53.

Wasserbacher, H. and Spindler, M. (2021) 'Machine learning for financial forecasting, planning and analysis: recent developments and pitfalls,' Digital Finance, 4(1), pp. 63–88. https://doi.org/10.1007/s42521-021-00046-2.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Introduction:

Methodology

1. Data Import and Understanding

2. Data Cleaning

Checking for Missing Values and handling them

Checking for Duplicates

Converting Columns to appropriate format

Converting Returned column from Boolean to Int for further analysis

Summarizing the statistic of the Cleaned data

3.Exploratory Data Analysis

4. Outlier Detection and Handling

Dropping Columns

Plotting box and whisker plot to detect Outlier

Outlier Handling using Inter Quartile Range (IQR)

5.Modelling

Linear Regression

Random Forest Classification

K Means Clustering

Conclusion

References

Introduction: