Installs necessary dependency of openpyxl to load data using the read_excel() function.

!pip install openpyxl

Importing the data

Here the necessary imports are done that will be used throughout the program timeline. The imports include essential libraries such as Pandas, Matplotlib, Numpy, and LinearRegression model and different metrics for evaluating the regression model.

import pandas as pd import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score data = pd.read_excel('dataset_Superstore.xlsx') data.head()

Cleaning the data

Checking for missing values

Now, the data needs to be cleaned and prepared into a workable dataset through a process. First, the check for missing values should be conducted. This reveals that there are no missing values in the dataset.

missing_values = data.isnull().sum() print(missing_values)

The next step is to perform the removal of the outliers, which is an important step in the cleaning process and leads to a more efficient dataset that can produce useful results with a model.

Removing outliers

To remove outliers, there should be a clear indication of columns that should be checked for such values. This can be found out using describe method from Pandas. This describes the categorical columns.

data.describe(include='object')

This describes the numerical columns.

data.describe()

The columns Sales, Profit and Quantity all contain outliers highlighted by a smaller mean and much larger max values. Hence, the outliers can be removed from this column. Here, the implementation uses z-score method to remove outliers.

z_scores = np.abs((data[['Sales', 'Profit', 'Quantity']] - data[['Sales', 'Profit', 'Quantity']].mean()) / data[['Sales', 'Profit', 'Quantity']].std()) # The threshold for outlier detection threshold = 3 # Find rows where any z-score exceeds the threshold outliers = data[(z_scores > threshold).any(axis=1)] # Remove outliers final_data = data[(z_scores <= threshold).all(axis=1)] print(f"Removed {data.shape[0] - final_data.shape[0]} outliers")

This method has found 168 outliers and removed those from the dataset, leading to a much more usable dataset.

Summary statistics

Now, summary statistics can be plotted of the cleaned data to check the quality of the data that now the data frame contains.

final_data.describe()

The data has become much more coherent in nature now that the outliers have been removed.

Plotting the data

Bar plot of total sales across different states

Here the bar plot of sales in different states is shown. This is calculated using the group by function from Pandas, which provides a flexible way to arrange values.

# Grouping sales by state and calculating total sales sales_by_state = final_data.groupby('State')['Sales'].sum().sort_values(ascending=False) plt.figure(figsize=(12, 6)) sales_by_state.plot(kind='bar', color='green') plt.title('Total Sales by State') plt.xlabel('State') plt.ylabel('Total Sales') plt.xticks(rotation=45, ha='right') plt.show()

The visualization shows that California has the most sales and North Dakota has the least sales. Afterward, the trends of profit can be analysed through a continuous plot. This plot shows the profit obtained by the Superstore over the recorded days.

Continuous plot of daily profit trends

# Convert 'Order Date' column to datetime final_data['Order Date'] = pd.to_datetime(final_data['Order Date']) # Grouping profit by day and calculating total profit profit_per_day = final_data.groupby('Order Date')['Profit'].sum() # Plotting the line chart plt.figure(figsize=(12, 6)) profit_per_day.plot(kind='line', color='green') plt.title('Profit Per Day') plt.xlabel('Date') plt.ylabel('Total Profit') plt.xticks(rotation=45, ha='right') # This function keeps the entire plot inside the output bounds plt.tight_layout() plt.show()

As time went on, the profit seemed to increase, suggesting that the Superstore had increased its sales.

Regression Model

Justification

A regression model is a beneficial model that can identify important patterns in the data and can find the improvements the Superstore can make to improve its sales. The regression model is a type of model that works by fitting a line to data that can be predicted linearly (Kahwachi, 2020). Hence, it is a great choice for not only a baseline model creation but also a straightforward method to predict the target.

Data preparation

To work with a regression model, several aspects should be improved of the final data frame (Sapre and Vartak, 2020). The first part is the removal of unnecessary features that do not contribute to the outcome of profit. This includes the columns with ID, Dates, Country (Common value in all rows), and any type of name. Hence, the columns that will be removed are Row ID, 'Order ID', 'Order Date', 'Ship Date', 'Customer ID', 'Customer Name', 'Country', 'Product ID', and 'Product Name'.

Then the encoding of the categorical columns should be conducted.

# List of columns to remove columns_removed = ['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Customer ID', 'Customer Name', 'Country', 'Product ID', 'Product Name'] # Remove the columns final_data = final_data.drop(columns=columns_removed)

It can be seen that the columns have been successfully removed from the dataset.

final_data.columns

Now, the encoding to be done on the categorical columns, which are Ship Mode, Segment, City, State, Region, Category, and Sub-Category. For the encoding part, the data is sent through the one-hot and Label encoder to encode the data to make sure it is usable for the regression model. Here, LabelEncoder is imported from the sklearn library to ensure the function can be used.

from sklearn.preprocessing import LabelEncoder final_data = pd.get_dummies(final_data, columns=['Ship Mode', 'Segment', 'Category', 'Sub-Category', 'Region']) final_data.replace({True: 1, False: 0}, inplace=True) # Label encode 'City' and 'State' columns label_encoder = LabelEncoder() final_data['City'] = label_encoder.fit_transform(final_data['City']) final_data['State'] = label_encoder.fit_transform(final_data['State']) # Check the data frame print(final_data.head())

Now, the columns can be seen appropriately encoded to be usable.

final_data.columns

In total, there are 46 features after the encoding is performed.

final_data.shape

Training the Linear Regression model

Now the model can be trained on the data. This step requires first splitting the data using train_test_split() function and then fitting on the dataset using the loaded regression model (Muraina, 2022).

X = final_data.drop(columns=['Profit']) y = final_data['Profit'] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42) # Loading a regression model and fitting on the data model = LinearRegression() model.fit(X_train, y_train) # Make predictions on the testing data y_pred = model.predict(X_test) # Evaluate the model's performance mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("Mean Squared Error:", mse) print("R-squared:", r2)

Mean Squared Error of approximately, 4657.72 and an R-squared value of around 0.381. This indicates that the model explains about 38.1% of the variance in the target variable, profit. However, the MSE is relatively high. This can be plotted in a graph to demonstrate the actual and predicted values in order to highlight the outcome of model training.

plt.scatter(range(len(y_test)), y_test, color='blue', label='Actual') # Overlay the predicted values plt.scatter(range(len(y_pred)), y_pred, color='green', label='Predicted') plt.title('Actual vs. Predicted Profit') plt.xlabel('Data Index') plt.ylabel('Profit') plt.legend() plt.show()

The plot shows the linear nature of the predicted values as compared to the actual values, which are more spread. This is the reason for such high MSE as it measures the average squared distance of the points from the line. Now, the features can be extracted to understand the model better.

Most important features in the model

# Extracting feature coefficients coefficients = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient']) coefficients = coefficients.abs() coefficients = coefficients.sort_values(by='Coefficient', ascending=False) # Top 10 contributing features print("Top Contributing Features:") print(coefficients.head(10))

The outcome of the model shows that the most contributing feature is the discount, which is then followed by the different subcategories such as copiers, tables, machines, and other types of products. It can be stated that for the company to profit more, they should be focusing on providing discounts and selling products such as copiers, Tables, and Machines.

References

Kahwachi, W. (2020). A Comparison of WK4 and MSE for Regression Model Fitting. Journal of Al-Rafidain University College For Sciences (Print ISSN: 1681-6870, Online ISSN: 2790-2293), (1), pp.530-535.

Muraina, I. (2022). Ideal dataset splitting ratios in machine learning algorithms: general concerns for data scientists and data analysts. In 7th International Mardin Artuklu Scientific Research Conference (pp. 496-504).

Sapre, A. and Vartak, S. (2020). Scientific Computing and Data Analysis using NumPy and Pandas. International Research Journal of Engineering and Technology, 7, pp.1334-1346.