Here openpyxl is installed to read the file via Pandas, as without it the code will not work and the file cannot be loaded.
Importing the data
First, importing of Pandas and Matplotlib is done. Pandas is a data wrangling library that allows efficient data handling through the use of data frames. Matplotlib is a library that allows for effective visualisation of the data. This is to make sure that data wrangling and data visualisation can be conducted. Then the data is loaded with the read_excel() function from Pandas as it is a .xslx file (Zhekova, 2023).
Viewing the data
To make sure the data is loaded properly, the function head() is used here, which shows top 5 entry in the data frame.
Check for missing values
To find the missing values, total null values are summed up and then printed as a count in each column. No missing value is found after checking through the data frame.
Summary statistics
The summary statistics is done for the data, which shows important information about the data. This information would allow for effective data cleaning procedure.
Since, the summary by default only shows the numeric columns, the include='object' keyword argument is used in the describe function to show the summary of the categorical or nominal variables in the data frame.
Removing outliers
By investigating the summary statistics, the columns with the outliers can be found. The three columns with outliers are Quantity, Profit and Sales. Therefore, the rows with outlier values can be removed with IQR (Inter-Quartile Range) method (Domański, 2020). It clamps the values between [Q1-1.5*IQR, Q3 + 1.5*IQR] range.
The removal shortened the data frame into a subset containing 7782 data points.
Plotting the data
Bar chart of Ship Mode
Ship Mode is a crucial metric to understand as it can influence the profit company makes. Here a bar chart is visualised to understand the categorical variable Ship Mode further. It is clear that most frequent shipping mode type is Standard Class and the least is Same Day. This can be attributed to the fact that for same day delivery, extra charges can apply. This can in turn deter the customers from making the purchase.
Scatter plot of Discount vs. Profit
A scatter plot of two variables Discount and Profit is presented here which shows insights into the customer behavior of the shop. Most profit is seen when people are buying at 20% discount and as the discount increases, a trend can be observed. Profit that is obtained is of similar amount irrespective of the discount being 60% or 80%.
Modelling on the data
Here for modelling on the data best worst method is used to find the most and least profitable products in the data frame.
Justification
The Best-Worst method (BWM) provides a robust framework for decision-making by systematically evaluating and ranking features or criteria based on their relative importance (Pamučar et al., 2020). By employing BWM, we can effectively identify the most influential features from a set of options. This approach offers several advantages, including simplicity in implementation, transparency in the decision-making process, and the ability to handle complex decision scenarios with multiple criteria. Moreover, BWM allows accounting for both the positive (best) and negative (worst) aspects of each feature, providing a more comprehensive understanding of their impact.
Data preparation
The data needs to be prepared before setting up inside a model. The first task is to change the categorical features into encoded features such that it can be processed by a machine. Most machine learning models do not work with raw textual information and instead uses some form of encoding to make sure the data is efficiently stored for computation. In this scenario, the columns 'Ship Mode', 'Segment', 'Region', 'Category', 'Sub-Category' and 'Returned' are encoded from the data frame.
Then the label encoder is used to encode the 'City' and 'State' columns to make sure there are no textual information. Then the removal of some variable or features is done, which includes 'Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Customer ID', 'Customer Name', 'Customer_no', 'Product ID', 'Product Name' and 'Country'.
These features are omitted due to their lack of contributing factor to the profit directly. Most of these features are unique and arbitrary, which has little significance to the predictive model. The Country column contains only one country, which is why it is omitted. Feature such as the product name and Customer name also is not significant. The category and subcategory columns are more useful information here than the names.
Modelling the function for Best-Worst method
This Python function implements the Best-Worst method (BWM) for analysing the influence of features in a dataset. It begins by computing the mean and standard deviation of each feature to normalise the scores. Then, it calculates the normalised score for each feature. After obtaining the normalised scores, it determines the best and worst scores for each feature. By subtracting the worst score from the best score, the function computes the net score for each feature, indicating its influence (Liang et al., 2020). This allows for more efficient modelling of the features.
Most influential features in the model
After finding the influential factors from the model, it is clear that some subcategories are critical in the profit of the company, which include the Copiers, Machines, Tables and Bookcases. On the other hand, Ship Mode seems to be a less major feature in the model's predictive capability. It can be concluded from the model that the client should focus on the strength of their business that is the categories that performed well. This would imply that focusing on those categories would lead to more profit in the long term.
References
Domański, P.D. (2020). Study on statistical outlier detection and labeling. International Journal of Automation and Computing, 17(6), pp.788-811.
Liang, F., Brunelli, M. and Rezaei, J. (2020). Consistency issues in the best worst method: Measurements and thresholds. Omega, 96, p.102175.
Pamučar, D., Ecer, F., Cirovic, G. and Arlasheedi, M.A. (2020). Application of improved best worst method (BWM) in real-world problems. Mathematics, 8(8), p.1342.
Zhekova, M. (2023). An Algorithm for Exploratory Analysis and Normalization of Big Data with Pandas. In Proceedings of the Bulgarian Academy of Sciences (Vol. 76, No. 11, pp. 1716-1723).