Here openpyxl is installed to read the file via Pandas, as without it the code will not work and the file cannot be loaded.

!pip install openpyxl

Importing the data

First, importing of Pandas and Matplotlib is done. Pandas is a data wrangling library that allows efficient data handling through the use of data frames. Matplotlib is a library that allows for effective visualisation of the data. This is to make sure that data wrangling and data visualisation can be conducted. Then the data is loaded with the read_excel() function from Pandas as it is a .xslx file (Zhekova, 2023).

import pandas as pd import matplotlib.pyplot as plt df = pd.read_excel('dataset_Superstore.xlsx')

Viewing the data

To make sure the data is loaded properly, the function head() is used here, which shows top 5 entry in the data frame.

df.head()

Check for missing values

To find the missing values, total null values are summed up and then printed as a count in each column. No missing value is found after checking through the data frame.

print(df.isna().sum())

Summary statistics

The summary statistics is done for the data, which shows important information about the data. This information would allow for effective data cleaning procedure.

print(df.describe())

Since, the summary by default only shows the numeric columns, the include='object' keyword argument is used in the describe function to show the summary of the categorical or nominal variables in the data frame.

df.describe(include='object')

Removing outliers

By investigating the summary statistics, the columns with the outliers can be found. The three columns with outliers are Quantity, Profit and Sales. Therefore, the rows with outlier values can be removed with IQR (Inter-Quartile Range) method (Domański, 2020). It clamps the values between [Q1-1.5*IQR, Q3 + 1.5*IQR] range.

# Calculate IQR for each specified column Q1_quantity, Q3_quantity = df['Quantity'].quantile([0.25, 0.75]) IQR_quantity = Q3_quantity - Q1_quantity Q1_profit, Q3_profit = df['Profit'].quantile([0.25, 0.75]) IQR_profit = Q3_profit - Q1_profit Q1_sales, Q3_sales = df['Sales'].quantile([0.25, 0.75]) IQR_sales = Q3_sales - Q1_sales # Define boundaries to filter out outliers lower_bound_quantity = Q1_quantity - (1.5 * IQR_quantity) upper_bound_quantity = Q3_quantity + (1.5 * IQR_quantity) lower_bound_profit = Q1_profit - (1.5 * IQR_profit) upper_bound_profit = Q3_profit + (1.5 * IQR_profit) lower_bound_sales = Q1_sales - (1.5 * IQR_sales) upper_bound_sales = Q3_sales + (1.5 * IQR_sales) # Apply filters to the DataFrame df_cleaned = df[(df['Quantity'] >= lower_bound_quantity) & (df['Quantity'] <= upper_bound_quantity) & (df['Profit'] >= lower_bound_profit) & (df['Profit'] <= upper_bound_profit) & (df['Sales'] >= lower_bound_sales) & (df['Sales'] <= upper_bound_sales)] df_cleaned.shape

The removal shortened the data frame into a subset containing 7782 data points.

Plotting the data

Bar chart of Ship Mode

Ship Mode is a crucial metric to understand as it can influence the profit company makes. Here a bar chart is visualised to understand the categorical variable Ship Mode further. It is clear that most frequent shipping mode type is Standard Class and the least is Same Day. This can be attributed to the fact that for same day delivery, extra charges can apply. This can in turn deter the customers from making the purchase.

plt.figure(figsize=(8, 6)) df_cleaned['Ship Mode'].value_counts().plot(kind='bar', color='violet') plt.title('Types of shipping modes') plt.xlabel('Ship Mode') plt.ylabel('Count') plt.xticks(rotation=45) plt.show()

Scatter plot of Discount vs. Profit

A scatter plot of two variables Discount and Profit is presented here which shows insights into the customer behavior of the shop. Most profit is seen when people are buying at 20% discount and as the discount increases, a trend can be observed. Profit that is obtained is of similar amount irrespective of the discount being 60% or 80%.

plt.figure(figsize=(8, 6)) plt.scatter(df_cleaned['Discount'], df_cleaned['Profit'], alpha=0.5, color='royalblue') plt.title('Discount vs Profit') plt.xlabel('Discount') plt.ylabel('Profit') plt.show()

Modelling on the data

Here for modelling on the data best worst method is used to find the most and least profitable products in the data frame.

Justification

The Best-Worst method (BWM) provides a robust framework for decision-making by systematically evaluating and ranking features or criteria based on their relative importance (Pamučar et al., 2020). By employing BWM, we can effectively identify the most influential features from a set of options. This approach offers several advantages, including simplicity in implementation, transparency in the decision-making process, and the ability to handle complex decision scenarios with multiple criteria. Moreover, BWM allows accounting for both the positive (best) and negative (worst) aspects of each feature, providing a more comprehensive understanding of their impact.

Data preparation

The data needs to be prepared before setting up inside a model. The first task is to change the categorical features into encoded features such that it can be processed by a machine. Most machine learning models do not work with raw textual information and instead uses some form of encoding to make sure the data is efficiently stored for computation. In this scenario, the columns 'Ship Mode', 'Segment', 'Region', 'Category', 'Sub-Category' and 'Returned' are encoded from the data frame.

Then the label encoder is used to encode the 'City' and 'State' columns to make sure there are no textual information. Then the removal of some variable or features is done, which includes 'Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Customer ID', 'Customer Name', 'Customer_no', 'Product ID', 'Product Name' and 'Country'.

These features are omitted due to their lack of contributing factor to the profit directly. Most of these features are unique and arbitrary, which has little significance to the predictive model. The Country column contains only one country, which is why it is omitted. Feature such as the product name and Customer name also is not significant. The category and subcategory columns are more useful information here than the names.

from sklearn.preprocessing import LabelEncoder # Convert categorical features using one-hot encoding categorical_features = ['Ship Mode', 'Segment', 'Region', 'Category', 'Sub-Category', 'Returned'] df_encoded = pd.get_dummies(df_cleaned, columns=categorical_features, drop_first=True, dtype=int) # Create a LabelEncoder instance label_encoder = LabelEncoder() df_encoded['City'] = label_encoder.fit_transform(df_encoded['City']) df_encoded['State'] = label_encoder.fit_transform(df_encoded['State']) # The columns that will be dropped from the final data frame columns_to_drop = ['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Customer ID', 'Customer Name', 'Customer_no', 'Product ID', 'Product Name', 'Country'] df_final = df_encoded.drop(columns=columns_to_drop) # View the final data frame df_final.head()

Modelling the function for Best-Worst method

This Python function implements the Best-Worst method (BWM) for analysing the influence of features in a dataset. It begins by computing the mean and standard deviation of each feature to normalise the scores. Then, it calculates the normalised score for each feature. After obtaining the normalised scores, it determines the best and worst scores for each feature. By subtracting the worst score from the best score, the function computes the net score for each feature, indicating its influence (Liang et al., 2020). This allows for more efficient modelling of the features.

def best_worst_method(dataframe, features): # Calculate the mean value for each feature mean_values = dataframe[features].mean() # Calculate the normalized score for each feature normalized_scores = {} for feature in features: normalized_scores[feature] = (dataframe[feature] - mean_values[feature]) / dataframe[feature].std() # Calculate the best and worst score for each feature best_scores = {feature: normalized_scores[feature].max() for feature in features} worst_scores = {feature: normalized_scores[feature].min() for feature in features} # Calculate the net score for each feature net_scores = {feature: best_scores[feature] - worst_scores[feature] for feature in features} # Sort the features based on their net score sorted_features = sorted(net_scores.items(), key=lambda x: x[1], reverse=True) return sorted_features

Most influential features in the model

best_worst_features = best_worst_method(df_final, df_final.columns) print(best_worst_features)

After finding the influential factors from the model, it is clear that some subcategories are critical in the profit of the company, which include the Copiers, Machines, Tables and Bookcases. On the other hand, Ship Mode seems to be a less major feature in the model's predictive capability. It can be concluded from the model that the client should focus on the strength of their business that is the categories that performed well. This would imply that focusing on those categories would lead to more profit in the long term.

References

Domański, P.D. (2020). Study on statistical outlier detection and labeling. International Journal of Automation and Computing, 17(6), pp.788-811.

Liang, F., Brunelli, M. and Rezaei, J. (2020). Consistency issues in the best worst method: Measurements and thresholds. Omega, 96, p.102175.

Pamučar, D., Ecer, F., Cirovic, G. and Arlasheedi, M.A. (2020). Application of improved best worst method (BWM) in real-world problems. Mathematics, 8(8), p.1342.

Zhekova, M. (2023). An Algorithm for Exploratory Analysis and Normalization of Big Data with Pandas. In Proceedings of the Bulgarian Academy of Sciences (Vol. 76, No. 11, pp. 1716-1723).