Installs necessary dependency of openpyxl to load data using the read_excel() function.
Importing the data
Here the necessary imports are done that will be used throughout the program timeline. The imports include essential libraries such as Pandas, Matplotlib, Numpy, and LinearRegression model and different metrics for evaluating the regression model.
Cleaning the data
Checking for missing values
Now, the data needs to be cleaned and prepared into a workable dataset through a process. First, the check for missing values should be conducted. This reveals that there are no missing values in the dataset.
The next step is to perform the removal of the outliers, which is an important step in the cleaning process and leads to a more efficient dataset that can produce useful results with a model.
Removing outliers
To remove outliers, there should be a clear indication of columns that should be checked for such values. This can be found out using describe method from Pandas. This describes the categorical columns.
This describes the numerical columns.
The columns Sales, Profit and Quantity all contain outliers highlighted by a smaller mean and much larger max values. Hence, the outliers can be removed from this column. Here, the implementation uses z-score method to remove outliers.
This method has found 168 outliers and removed those from the dataset, leading to a much more usable dataset.
Summary statistics
Now, summary statistics can be plotted of the cleaned data to check the quality of the data that now the data frame contains.
The data has become much more coherent in nature now that the outliers have been removed.
Plotting the data
Bar plot of total sales across different states
Here the bar plot of sales in different states is shown. This is calculated using the group by function from Pandas, which provides a flexible way to arrange values.
The visualization shows that California has the most sales and North Dakota has the least sales. Afterward, the trends of profit can be analysed through a continuous plot. This plot shows the profit obtained by the Superstore over the recorded days.
Continuous plot of daily profit trends
As time went on, the profit seemed to increase, suggesting that the Superstore had increased its sales.
Regression Model
Justification
A regression model is a beneficial model that can identify important patterns in the data and can find the improvements the Superstore can make to improve its sales. The regression model is a type of model that works by fitting a line to data that can be predicted linearly (Kahwachi, 2020). Hence, it is a great choice for not only a baseline model creation but also a straightforward method to predict the target.
Data preparation
To work with a regression model, several aspects should be improved of the final data frame (Sapre and Vartak, 2020). The first part is the removal of unnecessary features that do not contribute to the outcome of profit. This includes the columns with ID, Dates, Country (Common value in all rows), and any type of name. Hence, the columns that will be removed are Row ID, 'Order ID', 'Order Date', 'Ship Date', 'Customer ID', 'Customer Name', 'Country', 'Product ID', and 'Product Name'.
Then the encoding of the categorical columns should be conducted.
It can be seen that the columns have been successfully removed from the dataset.
Now, the encoding to be done on the categorical columns, which are Ship Mode, Segment, City, State, Region, Category, and Sub-Category. For the encoding part, the data is sent through the one-hot and Label encoder to encode the data to make sure it is usable for the regression model. Here, LabelEncoder is imported from the sklearn library to ensure the function can be used.
Now, the columns can be seen appropriately encoded to be usable.
In total, there are 46 features after the encoding is performed.
Training the Linear Regression model
Now the model can be trained on the data. This step requires first splitting the data using train_test_split() function and then fitting on the dataset using the loaded regression model (Muraina, 2022).
Mean Squared Error of approximately, 4657.72 and an R-squared value of around 0.381. This indicates that the model explains about 38.1% of the variance in the target variable, profit. However, the MSE is relatively high. This can be plotted in a graph to demonstrate the actual and predicted values in order to highlight the outcome of model training.
The plot shows the linear nature of the predicted values as compared to the actual values, which are more spread. This is the reason for such high MSE as it measures the average squared distance of the points from the line. Now, the features can be extracted to understand the model better.
Most important features in the model
The outcome of the model shows that the most contributing feature is the discount, which is then followed by the different subcategories such as copiers, tables, machines, and other types of products. It can be stated that for the company to profit more, they should be focusing on providing discounts and selling products such as copiers, Tables, and Machines.
References
Kahwachi, W. (2020). A Comparison of WK4 and MSE for Regression Model Fitting. Journal of Al-Rafidain University College For Sciences (Print ISSN: 1681-6870, Online ISSN: 2790-2293), (1), pp.530-535.
Muraina, I. (2022). Ideal dataset splitting ratios in machine learning algorithms: general concerns for data scientists and data analysts. In 7th International Mardin Artuklu Scientific Research Conference (pp. 496-504).
Sapre, A. and Vartak, S. (2020). Scientific Computing and Data Analysis using NumPy and Pandas. International Research Journal of Engineering and Technology, 7, pp.1334-1346.