Installs necessary dependency of openpyxl to load data using the read_excel() function.

```
Collecting openpyxl
Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 250.0/250.0 kB 31.8 MB/s eta 0:00:00
Collecting et-xmlfile (from openpyxl)
Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2
[notice] A new release of pip is available: 23.1.2 -> 24.0
[notice] To update, run: pip install --upgrade pip
```

# Importing the data

Here the necessary imports are done that will be used throughout the program timeline. The imports include essential libraries such as Pandas, Matplotlib, Numpy, and LinearRegression model and different metrics for evaluating the regression model.

0

3783

CA-2017-165204

1

7322

CA-2017-167626

2

1709

CA-2017-123491

3

2586

CA-2015-121041

4

356

CA-2016-138520

# Cleaning the data

## Checking for missing values

Now, the data needs to be cleaned and prepared into a workable dataset through a process. First, the check for missing values should be conducted. This reveals that there are no missing values in the dataset.

```
Row ID 0
Order ID 0
Order Date 0
Ship Date 0
Ship Mode 0
Customer ID 0
Customer Name 0
Customer_no 0
Segment 0
Segment_no 0
Country 0
City 0
State 0
State_no 0
Postal Code 0
Region 0
Region_no 0
Product ID 0
Category 0
Category_no 0
Sub-Category 0
Sub-Category_no 0
Product Name 0
Product Name_no 0
Sales 0
Quantity 0
Discount 0
Profit 0
Returned 0
dtype: int64
```

The next step is to perform the removal of the outliers, which is an important step in the cleaning process and leads to a more efficient dataset that can produce useful results with a model.

## Removing outliers

To remove outliers, there should be a clear indication of columns that should be checked for such values. This can be found out using describe method from Pandas. This describes the categorical columns.

count

9994

9994

unique

5009

4

top

CA-2017-100111

Standard Class

freq

14

5968

This describes the numerical columns.

count

9994

9994

mean

4997.5

2016-04-30 00:07:12.259355392

min

1

2014-01-03 00:00:00

25%

2499.25

2015-05-23 00:00:00

50%

4997.5

2016-06-26 00:00:00

75%

7495.75

2017-05-14 00:00:00

max

9994

2017-12-30 00:00:00

std

2885.163629

nan

The columns Sales, Profit and Quantity all contain outliers highlighted by a smaller mean and much larger max values. Hence, the outliers can be removed from this column. Here, the implementation uses z-score method to remove outliers.

```
Removed 168 outliers
```

This method has found 168 outliers and removed those from the dataset, leading to a much more usable dataset.

## Summary statistics

Now, summary statistics can be plotted of the cleaned data to check the quality of the data that now the data frame contains.

count

9826

9826

mean

4992.430999

2016-04-29 22:02:28.015469312

min

1

2014-01-03 00:00:00

25%

2492.5

2015-05-23 06:00:00

50%

4987.5

2016-06-26 00:00:00

75%

7484.75

2017-05-14 00:00:00

max

9994

2017-12-30 00:00:00

std

2883.071762

nan

The data has become much more coherent in nature now that the outliers have been removed.

# Plotting the data

## Bar plot of total sales across different states

Here the bar plot of sales in different states is shown. This is calculated using the group by function from Pandas, which provides a flexible way to arrange values.

The visualization shows that California has the most sales and North Dakota has the least sales. Afterward, the trends of profit can be analysed through a continuous plot. This plot shows the profit obtained by the Superstore over the recorded days.

## Continuous plot of daily profit trends

```
/tmp/ipykernel_59/1036106629.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
final_data['Order Date'] = pd.to_datetime(final_data['Order Date'])
```

As time went on, the profit seemed to increase, suggesting that the Superstore had increased its sales.

# Regression Model

## Justification

A regression model is a beneficial model that can identify important patterns in the data and can find the improvements the Superstore can make to improve its sales. The regression model is a type of model that works by fitting a line to data that can be predicted linearly (Kahwachi, 2020). Hence, it is a great choice for not only a baseline model creation but also a straightforward method to predict the target.

## Data preparation

To work with a regression model, several aspects should be improved of the final data frame (Sapre and Vartak, 2020). The first part is the removal of unnecessary features that do not contribute to the outcome of profit. This includes the columns with ID, Dates, Country (Common value in all rows), and any type of name. Hence, the columns that will be removed are Row ID, 'Order ID', 'Order Date', 'Ship Date', 'Customer ID', 'Customer Name', 'Country', 'Product ID', and 'Product Name'.

Then the encoding of the categorical columns should be conducted.

It can be seen that the columns have been successfully removed from the dataset.

Now, the encoding to be done on the categorical columns, which are Ship Mode, Segment, City, State, Region, Category, and Sub-Category. For the encoding part, the data is sent through the one-hot and Label encoder to encode the data to make sure it is usable for the regression model. Here, LabelEncoder is imported from the sklearn library to ensure the function can be used.

```
Customer_no Segment_no City State State_no Postal Code Region_no \
0 1 1 289 40 41 38109 1
1 2 2 80 11 12 60623 4
2 3 1 438 3 4 94122 2
3 4 2 186 41 42 76117 4
4 5 1 329 30 31 10035 3
Category_no Sub-Category_no Product Name_no ... Sub-Category_Machines \
0 2 1 1 ... 0
1 2 1 1 ... 0
2 2 1 1 ... 0
3 2 2 1 ... 0
4 2 2 1 ... 0
Sub-Category_Paper Sub-Category_Phones Sub-Category_Storage \
0 1 0 0
1 1 0 0
2 1 0 0
3 0 0 0
4 0 0 0
Sub-Category_Supplies Sub-Category_Tables Region_Central Region_East \
0 0 0 0 0
1 0 0 1 0
2 0 0 0 0
3 0 0 1 0
4 0 0 0 1
Region_South Region_West
0 1 0
1 0 0
2 0 1
3 0 0
4 0 0
[5 rows x 46 columns]
```

Now, the columns can be seen appropriately encoded to be usable.

In total, there are 46 features after the encoding is performed.

## Training the Linear Regression model

Now the model can be trained on the data. This step requires first splitting the data using train_test_split() function and then fitting on the dataset using the loaded regression model (Muraina, 2022).

```
Mean Squared Error: 4657.717874059397
R-squared: 0.38099036425489474
```

Mean Squared Error of approximately, 4657.72 and an R-squared value of around 0.381. This indicates that the model explains about 38.1% of the variance in the target variable, profit. However, the MSE is relatively high. This can be plotted in a graph to demonstrate the actual and predicted values in order to highlight the outcome of model training.

The plot shows the linear nature of the predicted values as compared to the actual values, which are more spread. This is the reason for such high MSE as it measures the average squared distance of the points from the line. Now, the features can be extracted to understand the model better.

## Most important features in the model

```
Top Contributing Features:
Coefficient
Discount 156.961539
Sub-Category_Copiers 97.090491
Sub-Category_Tables 72.499914
Sub-Category_Machines 55.798237
Sub-Category_Furnishings 50.049010
Sub-Category_Binders 39.988426
Sub-Category_Supplies 25.222659
Sub-Category_Storage 24.891543
Sub-Category_Phones 24.681700
Category_no 21.991924
```

The outcome of the model shows that the most contributing feature is the discount, which is then followed by the different subcategories such as copiers, tables, machines, and other types of products. It can be stated that for the company to profit more, they should be focusing on providing discounts and selling products such as copiers, Tables, and Machines.

# References

Kahwachi, W. (2020). A Comparison of WK4 and MSE for Regression Model Fitting. Journal of Al-Rafidain University College For Sciences (Print ISSN: 1681-6870, Online ISSN: 2790-2293), (1), pp.530-535.

Muraina, I. (2022). Ideal dataset splitting ratios in machine learning algorithms: general concerns for data scientists and data analysts. In 7th International Mardin Artuklu Scientific Research Conference (pp. 496-504).

Sapre, A. and Vartak, S. (2020). Scientific Computing and Data Analysis using NumPy and Pandas. International Research Journal of Engineering and Technology, 7, pp.1334-1346.