!pip install openpyxl

Importing the data

Here the Pandas and Matplotlib libraries are imported for data analysis and visualisation. Then the data is imported using the read_excel() function that can read excel files with different file formats.

import pandas as pd import matplotlib.pyplot as plt # Importing the data using read_excel function file_name = "dataset_Superstore.xlsx" df = pd.read_excel(file_name) df.head()

Cleaning the data

No missing values can be found after checking using the isnull() function. The presence of missing values should be highlighted here.

print(df.isnull().sum())

Here is the summary statistics of the data for categorical variables. Since the default describe function does not show categorical variables the optional keyword argument of passing object to the include argument is done.

df.describe(include='object')

Here is the summary statistics for the numerical columns. It is vital to observe that the meaningful numerical values show the mean or standard deviation accurately. The columns such as Dates do not have such values.

df.describe()

Removing the outliers with the help of standard deviation. It is important to remove any type of outliers present in the dataset, which is done here by using the method of creating range using mean and standard deviation to create meaningful boundaries within which the variables Quantity, Profit and Sales would fall as these columns contain outliers (Berger and Kiefer, 2021).

# Define a function to remove outliers based on standard deviation def remove_outliers(df, column, threshold=3): mean = df[column].mean() std_dev = df[column].std() lower_bound = mean - threshold * std_dev upper_bound = mean + threshold * std_dev return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)] # Remove outliers from the specified columns columns = ["Quantity", "Profit", "Sales"] for column in columns: df = remove_outliers(df, column) # Display the summary statistics of the cleaned data print("Summary Statistics of the Cleaned Data:") print(df.describe())

Plotting the data

A bar plot of different product sales with respect to the category is provided as it is useful to understand what sells the most for the company. It turns out the most sold product category is Office Supplies. It is then followed by Furniture and Technological products.

# Bar chart of products by category plt.figure(figsize=(10, 6)) df['Category'].value_counts().plot(kind='bar', color='lightpink') plt.title('Products by Category') plt.xlabel('Category') plt.ylabel('Count') plt.xticks(rotation=45) plt.show()

Here the trend of sales from the range of which the data is collected is plotted as a line chart. This plot shows that there is a trend of sales from t he sales data. The sales grow increasingly more as the year ends. At the end of each year the sales spikes and then goes down at the beginning of each year. Another observable pattern is that throughout the years the company has improved their sales and has gotten more sales, which can be identified from the large spikes at the end of the plot.

# Line chart of total sales over time plt.figure(figsize=(10, 6)) df.groupby('Order Date')['Sales'].sum().plot(color='orange') plt.title('Total sales over time') plt.xlabel('Order Date') plt.ylabel('Total Sales') plt.xticks(rotation=45) plt.grid(True) plt.show()

Modelling an ML algorithm

The K-Means clustering algorithm will be used to model the data as it can find out clusters of data points that are closer in proximity and hence, infer hidden patterns in the data that are otherwise not apparent. The K-Means clustering is a sophisticated algorithm that finds data grouping effortlessly and can help find important features that will be valuable for the company's sales optimisation (Ahmed et al., 2020).

Here the encoding of the 'Ship Mode', 'Segment', 'Region', 'Category', 'Sub-Category', 'City', 'State', and 'Returned' features is done, while keeping the 'Sales', 'Quantity', 'Discount' and 'Profit' features scaled using the Standard Scalar. This is a necessary step that is beneficial to the K-Means algorithm as it would improve the quality of the modelling of the data and provide greater insights that will be beneficial for the company.

from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import StandardScaler # Required Columns columns = ['Ship Mode', 'Segment', 'City', 'State', 'Region', 'Category', 'Sub-Category', 'Sales', 'Quantity', 'Discount', 'Profit', 'Returned'] # Create a copy of the dataframe with required columns df_new = df[columns].copy() # Encode categorical variables le = LabelEncoder() columns_to_encode = ['Ship Mode', 'Segment', 'City', 'State', 'Region', 'Category', 'Sub-Category', 'Returned'] for column in columns_to_encode: df_new[column] = le.fit_transform(df_new[column]) # Columns to scale columns_to_scale = ['Sales', 'Quantity', 'Discount', 'Profit'] # Scale the numerical features scaler = StandardScaler() df_new[columns_to_scale] = scaler.fit_transform(df_new[columns_to_scale]) df_new.head()

Here the K-Means clustering is done by first importing the necessary libraries and then focusing on optimising the cluster size for the algorithm. This is found using the Silhouette score, which is one of the best way to find the optimal number of clusters (Oktarina et al., 2020).

from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score features_for_clustering = df_new.drop(['Profit'], axis=1) # Defining a range of clusters to analyse min_clusters = 2 max_clusters = 10 k_range = range(min_clusters, max_clusters+1) # List to hold the average within-cluster sum of squares for each number of clusters wcss = [] # List to hold silhouette score for each number of clusters silhouette_avg = [] # Perform KMeans clustering for a range of clusters for k in k_range: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(features_for_clustering) wcss.append(kmeans.inertia_) cluster_labels = kmeans.fit_predict(features_for_clustering) # Calculate the silhouette score for the current number of clusters silhouette_avg.append(silhouette_score(features_for_clustering, cluster_labels)) # Calculate the best number of clusters and the corresponding silhouette score best_num_clusters = silhouette_avg.index(max(silhouette_avg)) + min_clusters best_silhouette_score = max(silhouette_avg) optimal_n_clusters = silhouette_avg.index(max(silhouette_avg)) + min_clusters silhouette_avg = max(silhouette_avg) # Conduct final KMeans clustering with an optimal number of clusters kmeans_final = KMeans(n_clusters=optimal_n_clusters, init='k-means++', random_state=42) kmeans_final.fit(features_for_clustering) cluster_labels_final = kmeans_final.fit_predict(features_for_clustering) # Add cluster labels to the dataframe df_new['cluster'] = cluster_labels_final centroid_features = features_for_clustering.columns centroids = pd.DataFrame(kmeans_final.cluster_centers_, columns=centroid_features) best_num_clusters, best_silhouette_score, centroids

The optimal number of clusters found is 2. The features from the data frame contributing to maximum variance are 'City', 'State' and 'Sub-Category'. These are likely to be the most influential features contributing to 'Profit' of the company. However, for optimisation, the company should focus on profitable subcategories of the product and optimising discounts on different locations throughout their business. Hence, these should be optimised in order to improve the profit of the company.

References

Ahmed, M., Seraj, R. and Islam, S.M.S. (2020). The k-means algorithm: A comprehensive survey and performance evaluation. Electronics, 9(8), p.1295.

Berger, A. and Kiefer, M. (2021). Comparison of different response time outlier exclusion methods: A simulation study. Frontiers in psychology, 12, p.675558.

Oktarina, C., Notodiputro, K.A. and Indahwati, I. (2020). Comparison of k-means clustering method and k-medoids on twitter data. Indonesian Journal of Statistics and Its Applications, 4(1), pp.189-202.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Importing the data

Cleaning the data

Plotting the data

Modelling an ML algorithm

References

Importing the data