# C11BD Big Data Analytics 2023-2024: Individual Coursework 2

submitted by: Faiqa Niaz :H00458048

# Intorduction

The Super store company's operational trends and key performance metrics are analyzed in this research. Comprehensive examination of metrics pertaining to revenue, profitability, merchandise, and customers are evaulated. The study aims to offer opportunities for development and areas for enhancement to support profitable strategic decision-making processes. The company's stakeholders have the authority to comprehend and take action on the insights given by the concise explanation and lucid illustrations that are displayed. The reportÂ encourages data-driven decision-making and methods to boost earnings in order to maintain the business agile and responsive to changing market conditions.

# 3.1: Importing data

## Importing Libraries

The installation of the Wooldridge package to use pip will be the first step in the procedure. Then, important libraries will be imported from pythons. These libraries facilitate in data manipulation, data visualization, plotting data, doing numerical calculations and statistical function.

```
Requirement already satisfied: wooldridge in /usr/local/lib/python3.9/site-packages (0.4.4)
Requirement already satisfied: pandas in /shared-libs/python3.9/py/lib/python3.9/site-packages (from wooldridge) (2.1.4)
Requirement already satisfied: tzdata>=2022.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from pandas->wooldridge) (2022.5)
Requirement already satisfied: python-dateutil>=2.8.2 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from pandas->wooldridge) (2.8.2)
Requirement already satisfied: numpy<2,>=1.22.4 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from pandas->wooldridge) (1.23.4)
Requirement already satisfied: pytz>=2020.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from pandas->wooldridge) (2022.5)
Requirement already satisfied: six>=1.5 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas->wooldridge) (1.16.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
```

```
Requirement already satisfied: statsmodels==0.14.1 in /root/venv/lib/python3.9/site-packages (0.14.1)
Requirement already satisfied: numpy<2,>=1.18 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels==0.14.1) (1.23.4)
Requirement already satisfied: patsy>=0.5.4 in /root/venv/lib/python3.9/site-packages (from statsmodels==0.14.1) (0.5.6)
Requirement already satisfied: scipy!=1.9.2,>=1.4 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels==0.14.1) (1.9.3)
Requirement already satisfied: packaging>=21.3 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from statsmodels==0.14.1) (21.3)
Requirement already satisfied: pandas!=2.1.0,>=1.0 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels==0.14.1) (2.1.4)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from packaging>=21.3->statsmodels==0.14.1) (3.0.9)
Requirement already satisfied: tzdata>=2022.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from pandas!=2.1.0,>=1.0->statsmodels==0.14.1) (2022.5)
Requirement already satisfied: python-dateutil>=2.8.2 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from pandas!=2.1.0,>=1.0->statsmodels==0.14.1) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from pandas!=2.1.0,>=1.0->statsmodels==0.14.1) (2022.5)
Requirement already satisfied: six in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from patsy>=0.5.4->statsmodels==0.14.1) (1.16.0)
[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: pip install --upgrade pip
```

## Importing Data Set

So, First of all data set 'dataset_Superstore' is uploaded in new data fram

After this we will be able to clean and analyze the data

## Data Set Examination

To start the cleaning data set is examined

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9994 non-null int64
1 Order ID 9994 non-null object
2 Order Date 9994 non-null object
3 Ship Date 9994 non-null object
4 Ship Mode 9994 non-null object
5 Customer ID 9994 non-null object
6 Customer Name 9994 non-null object
7 Customer_no 9994 non-null int64
8 Segment 9994 non-null object
9 Segment_no 9994 non-null int64
10 Country 9994 non-null object
11 City 9994 non-null object
12 State 9994 non-null object
13 State_no 9994 non-null int64
14 Postal Code 9994 non-null int64
15 Region 9994 non-null object
16 Region_no 9994 non-null int64
17 Product ID 9994 non-null object
18 Category 9994 non-null object
19 Category_no 9994 non-null int64
20 Sub-Category 9994 non-null object
21 Sub-Category_no 9994 non-null int64
22 Product Name 9994 non-null object
23 Product Name_no 9994 non-null int64
24 Sales 9994 non-null float64
25 Quantity 9994 non-null int64
26 Discount 9994 non-null float64
27 Profit 9994 non-null float64
28 Returned 9994 non-null bool
dtypes: bool(1), float64(3), int64(10), object(15)
memory usage: 2.1+ MB
```

Tha data has no missing values. However the date format for two columns Shaip Date and Ship Order are not correct.

# 3.2 Cleaning the data

## Converting Data types

In the following code data types of two columns({Order Date)} and ({Ship Date)} are converted

Dates may be easily manipulated and analyzed using datetime format, making it possible to perform activities like grouping by time periods.

## Check for Duplicates

This code checks for and removes duplicate rows from the DataFrame "df"

Eliminating duplicates guarantees unique observations and mitigates skewed analytical outcomes, hence enhancing the precision of analyses.

## Removing Outliers

Outliers impact the result of data. So, it is very importonat to indentify and remove outliers for accurate results (Kwak et al.2017). Outliers are removed from Sale, profit and quantity data using IQR method. The code blocks that follow locate and eliminate outliers from the dataset's four columns. It defines the `remove_outliers` function, which accepts as inputs a DataFrame {df}) and a column name ({col_name}). For the given column, it computes the interquartile range (IQR), first quartile (Q1), and third quartile (Q3) . Finally it assigns the filtered DataFrame back to `df` each time, effectively removing outliers from each specified column.

The code effectively eliminates outliers, thus improving the data's dependability and integrity for later examination.. This systematic approach ensures that the dataset remains robust and conducive to meaningful analysis, ultimately contributing to more accurate and insightful conclusions regarding the factors influencing profitability within the company.

# 3.3: Statistical summary of the cleaned data:

A statistical summary for DataFrame df with the following columns can be generated as follows:

```
Sales Quantity Discount Profit
count 6703.000000 6703.000000 6703.000000 6703.000000
mean 66.781959 3.367597 0.096958 11.232683
std 84.472853 1.864167 0.105484 13.012873
min 0.990000 1.000000 0.000000 -27.715800
25% 14.835000 2.000000 0.000000 3.210000
50% 34.384000 3.000000 0.000000 7.780000
75% 82.260000 4.000000 0.200000 17.172600
max 496.860000 9.000000 0.500000 50.584800
```

The count shows that each column contains 6703 non-null data. The mean denotes each variable's average value. The average profit, for instance, is 11.23.

The standard deviation of each variable indicates how dispersed the data points are from the mean. Greater data variety is indicated by a higher standard deviation. Here, the profit (13.01) and sales (84.47) have the biggest standard deviations, indicating a larger range of values than quantity (1.86) and discount (0.10). The lowest and maximum values for each variable are shown, respectively, by min and max. For instance, the minimum sales amount was 0.99, while the maximum sales were 496.86.

## Description of New Data:

The df.info() function provides a conciseÂ overview of the DataFrame df, including information on data types, non-null values, and memory usage. It is useful to understandingÂ the DataFrame.

```
<class 'pandas.core.frame.DataFrame'>
Index: 6703 entries, 0 to 9993
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 6703 non-null int64
1 Order ID 6703 non-null object
2 Order Date 6703 non-null datetime64[ns]
3 Ship Date 6703 non-null datetime64[ns]
4 Ship Mode 6703 non-null object
5 Customer ID 6703 non-null object
6 Customer Name 6703 non-null object
7 Customer_no 6703 non-null int64
8 Segment 6703 non-null object
9 Segment_no 6703 non-null int64
10 Country 6703 non-null object
11 City 6703 non-null object
12 State 6703 non-null object
13 State_no 6703 non-null int64
14 Postal Code 6703 non-null int64
15 Region 6703 non-null object
16 Region_no 6703 non-null int64
17 Product ID 6703 non-null object
18 Category 6703 non-null object
19 Category_no 6703 non-null int64
20 Sub-Category 6703 non-null object
21 Sub-Category_no 6703 non-null int64
22 Product Name 6703 non-null object
23 Product Name_no 6703 non-null int64
24 Sales 6703 non-null float64
25 Quantity 6703 non-null int64
26 Discount 6703 non-null float64
27 Profit 6703 non-null float64
28 Returned 6703 non-null bool
dtypes: bool(1), datetime64[ns](2), float64(3), int64(10), object(13)
memory usage: 1.5+ MB
None
```

Additionally, the DataFrame df's first five rows are output via the print(df.head()) command, making it possible to quickly examine the dataset

```
Row ID Order ID Order Date Ship Date Ship Mode Customer ID \
0 3783 CA-2017-165204 2017-11-13 2017-11-16 Second Class MN-17935
1 7322 CA-2017-167626 2017-09-03 2017-09-07 Standard Class MY-18295
2 1709 CA-2017-123491 2017-10-30 2017-11-05 Standard Class JK-15205
3 2586 CA-2015-121041 2015-11-03 2015-11-10 Standard Class CS-12250
4 356 CA-2016-138520 2016-04-08 2016-04-13 Standard Class JL-15505
Customer Name Customer_no Segment Segment_no ... Category_no \
0 Michael Nguyen 1 Consumer 1 ... 2
1 Muhammed Yedwab 2 Corporate 2 ... 2
2 Jamie Kunitz 3 Consumer 1 ... 2
3 Chris Selesnick 4 Corporate 2 ... 2
4 Jeremy Lonsdale 5 Consumer 1 ... 2
Sub-Category Sub-Category_no \
0 Paper 1
1 Paper 1
2 Paper 1
3 Envelopes 2
4 Envelopes 2
Product Name Product Name_no Sales \
0 "While you Were Out" Message Book, One Form pe... 1 8.904
1 "While you Were Out" Message Book, One Form pe... 1 8.904
2 "While you Were Out" Message Book, One Form pe... 1 7.420
3 #10 Gummed Flap White Envelopes, 100/Box 1 6.608
4 #10 Gummed Flap White Envelopes, 100/Box 1 8.260
Quantity Discount Profit Returned
0 3 0.2 3.3390 False
1 3 0.2 3.3390 False
2 2 0.0 3.7100 True
3 2 0.2 2.1476 False
4 2 0.0 3.7996 False
[5 rows x 29 columns]
```

Non-null counts show that there are no missing values in the data, and all data types are suitable for the columns in which they belong.

## Size of New Data Frame

Let's Check the shape of data after eleiminating outliers. This will give us new count of Rows

```
Shape without outliers: (6703, 29)
```

After processing the data to eliminate outliers, 6703 rows and 29 columns make up the dataset.

# 3.4: Plotting

## Bar Charts

To see the relationship between profit and region and category, this code creates two Â bar charts.

The average profit across product categories is displayed in the bar chart on the left, where technology products generate the largest average profit when compared to office supplies and furnishings. This suggests that the might consider focusing more on products in the "Technology" category in order to increase sales. Likewise, the greater profit margin in the "West region" indicates that there are advantageous market circumstances, robust client demand, and viable business plans that may be pursued there.

## Scatter Plot:

The scatter plot produced by this code shows the link between "Sales" and "Profit.

The scatter plot indicates that "Profit" and "Sales" have Â significant association. There is a substantial positive correlation between these two variables, indicating that profit growth tends to follow sales increase. This suggests that increasing sales strategies could have a big effect on profitability.

## Correlation Matrix

Profit and sales are positively correlatedÂ (about 0.37), indicating that profit typically rises Â with increase inÂ sales. A stronger positive association can be shown between "Quantity" and "Profit" but is is less strong as compared to relation between "Sales" and "Profit.". There is a negative correlation (about -0.28) between "Discount" and "Profit," indicating that bigger discounts may result in fewer profits. The scatter plot indicates that "Profit" and "Sales" have the most significant association. This suggests that increasing sales strategies could significantly affect profitability.

# 3.5: Modelling

## k-means clustering

A popularÂ machine learning technique for splitting data into discrete groups or clusters is K-means clustering. The objective is to group data points into K number of clusters, where the user specifies K. When working with unlabeled data, K-means clustering is useful because it can identify inherent groupings without the requirement for pre-established categories. Furthermore, because K-means is scalable, it can handle big datasets effectively after the ideal number of clusters is established. For activities like consumer segmentation, K-means clustering is frequently utilized (Wu, 2021).

```
Cluster 0 Summary:
Sales 47.030518
Profit 3.595414
Discount 0.207335
dtype: float64
Cluster 1 Summary:
Sales 33.685037
Profit 10.492859
Discount 0.000335
dtype: float64
Cluster 2 Summary:
Sales 198.606342
Profit 30.553544
Discount 0.099428
dtype: float64
```

A cluser with a higher discount rate but comparatively low sales and profit is represented by Cluster 0. This shows that sales and profit margins are not greatly increased even with the larger discounts being given. So, discount techniques in this area might need to be reevaluated.

Out of the three clusters, Cluster 1 has the lowest sales but, surprisingly, the largest profit, along with a very negligible discount rate. This segmentÂ can be defined as high-efficiency if sales generate moreÂ profit due toÂ high-margin goods or efficient cost management. It implies that keeping discount rates low doesn't hurt profitability

With a reasonable discount rate, Cluster 2 exhibits the highest sales and profit figures. This implies striking a sensible balance between preserving profit margins and encouraging sales through discounts. It can mean that there is a great demand for the products in thisÂ Â segment.Â

### Recommendations

Given the different discount impacts in Clusters 0 and 2, it appears that applying discounts wisely is important. Finding the items or categories in Cluster 0 that don't profit from discounts is essential, and the plan should be modified accordingly.

One advantage of Cluster 1 is its strong profitability at low discount rates. Therefore, it's critical to concentrate on Â the factors that influence profitability in this segment. For example it can be finding best product mix.

The profitable and large sales of Cluster 2 show a successful matching of product and market. Thus it is i necessary to analyze the features of the items that make up this cluster.

Using cluster information to tailor marketing strategies is advised. Customers in the high-profit, low-discount cluster (Cluster 1), for instance, might be targeted with loyalty programs or premium products, while those in Cluster 2 might be the focus of strategically timed discounts to increase volume sales

## Regression Analysis

In the following codes OLS Regression analysis model is used to link between the outcome variable, "Profit," and other three variables. Method used in linear regression is the Ordinary Least Squares (OLS) approach, which minimizes the sum of squared errors to estimate coefficients. It requires little computer power and is a frequently used Â because of its simplicity. ItÂ offers a range of statistical tests, such as F-test and R-squared. OLS regression coefficients are a useful analytical tool because they provide Â information on the Â relationships between independent and dependent variables (Rout, 2020).

```
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.286
Model: OLS Adj. R-squared: 0.286
Method: Least Squares F-statistic: 895.8
Date: Mon, 18 Mar 2024 Prob (F-statistic): 0.00
Time: 10:51:59 Log-Likelihood: -25580.
No. Observations: 6703 AIC: 5.117e+04
Df Residuals: 6699 BIC: 5.119e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 6.8629 0.312 22.015 0.000 6.252 7.474
Sales 0.0613 0.002 37.876 0.000 0.058 0.065
Quantity 1.2841 0.073 17.692 0.000 1.142 1.426
Discount -41.7831 1.289 -32.418 0.000 -44.310 -39.256
==============================================================================
Omnibus: 387.679 Durbin-Watson: 1.019
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1642.626
Skew: -0.046 Prob(JB): 0.00
Kurtosis: 5.423 Cond. No. 1.04e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.04e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
```

The average of error squares, or the average squared difference between the estimated values and true value, is measured by an estimator's mean squared error (MSE).

```
Cross-validated MSE scores: -124.08605214350754
T-statistic: -20.51875544601299, P-value: 1.0972079793238714e-90
```

The predictive performance of the model is measured by the cross-validated Mean Squared Error (MSE). A mean mean square error of -124.086 indicates that although the model can make predictions, its accuracy may be enhanced. In the future, predictive performance can be improved by minimizing MSE in models.

Regarding the effect of discounts on profit, the model's conclusion is supported by the significant T-statistic (-20.518) and P-value (around 0). So, discounts have a substantial impact on profitability and should thus be carefully managed

### Multicollinearity Checking

It is importabt to look for multicollinearity amongst the predictor variables before drawing any conclusions or predictions from the regression model.

```
feature VIF
0 const 5.387798
1 Sales 1.037473
2 Quantity 1.014792
3 Discount 1.024611
```

AllÂ VIF values are close to 1 which means that multicollinearity is not a significant concern here. This implies that the independent variables justify their inclusion in the model.

### Segment Analysis

Segment analysis is done using the following code. It will examine how various segments influence profit.

```
Category: Office Supplies
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.304
Model: OLS Adj. R-squared: 0.303
Method: Least Squares F-statistic: 651.7
Date: Mon, 18 Mar 2024 Prob (F-statistic): 0.00
Time: 10:52:00 Log-Likelihood: -16405.
No. Observations: 4483 AIC: 3.282e+04
Df Residuals: 4479 BIC: 3.284e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.3405 0.322 13.473 0.000 3.709 4.972
Sales 0.0795 0.002 32.969 0.000 0.075 0.084
Quantity 1.3352 0.075 17.895 0.000 1.189 1.482
Discount -20.9474 1.421 -14.744 0.000 -23.733 -18.162
==============================================================================
Omnibus: 454.214 Durbin-Watson: 0.759
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2735.416
Skew: 0.282 Prob(JB): 0.00
Kurtosis: 6.785 Cond. No. 747.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Category: Furniture
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.379
Model: OLS Adj. R-squared: 0.377
Method: Least Squares F-statistic: 231.5
Date: Mon, 18 Mar 2024 Prob (F-statistic): 2.95e-117
Time: 10:52:00 Log-Likelihood: -4487.0
No. Observations: 1143 AIC: 8982.
Df Residuals: 1139 BIC: 9002.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 11.9927 0.860 13.946 0.000 10.306 13.680
Sales 0.0441 0.003 12.815 0.000 0.037 0.051
Quantity 0.9224 0.212 4.343 0.000 0.506 1.339
Discount -86.6221 3.465 -25.002 0.000 -93.420 -79.824
==============================================================================
Omnibus: 3.965 Durbin-Watson: 1.457
Prob(Omnibus): 0.138 Jarque-Bera (JB): 4.243
Skew: -0.071 Prob(JB): 0.120
Kurtosis: 3.263 Cond. No. 1.57e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.57e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Category: Technology
OLS Regression Results
==============================================================================
Dep. Variable: Profit R-squared: 0.478
Model: OLS Adj. R-squared: 0.477
Method: Least Squares F-statistic: 328.1
Date: Mon, 18 Mar 2024 Prob (F-statistic): 3.81e-151
Time: 10:52:00 Log-Likelihood: -4167.4
No. Observations: 1077 AIC: 8343.
Df Residuals: 1073 BIC: 8363.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 10.4490 0.865 12.085 0.000 8.752 12.146
Sales 0.1010 0.004 26.321 0.000 0.093 0.109
Quantity 0.1947 0.210 0.926 0.354 -0.218 0.607
Discount -60.4550 3.089 -19.571 0.000 -66.516 -54.394
==============================================================================
Omnibus: 59.738 Durbin-Watson: 1.106
Prob(Omnibus): 0.000 Jarque-Bera (JB): 178.982
Skew: -0.210 Prob(JB): 1.36e-39
Kurtosis: 4.952 Cond. No. 1.28e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.28e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
```

### Conclusions

With an R-squared of 0.304, the model appears to be able to explain about 30.4% of the variation in profit within the Office Supplies category. Profit is positively correlated with sales and quantity, meaning that if these factors rise, profit will aslo rise. However, the discount's negative coefficient (-20.9474) suggests that lower profit margins in this category are due to larger discounts.

With an R-squared of 0.379, the model is able to account for 37.9% of the variation in profit across the Furniture category. This is a moderate link. Sales and quantity have a positive association with profit, however discounts haveÂ negative influence, with a coefficient of -86.6221. This shows that in furniture sector discounts have negative affect on profit. So,Â decrease inÂ discounts could increase profitability.

This model explains 47.8% of the variability in profit, which is a stronger link than the other two categories. Its R-squared value is 0.478. The positive coefficients ofÂ quantity (0.1947) and sales (0.11010) ,Â demonstrate that these elements can raise profit in the Technology area. The discount coeffiecient is negative(-60.455), indicating that it is affecting profits.

### Recomendations

In order to make aÂ balance between increasing sales and preserving profit margins, the discount approach needs to be reevaluated, particularly in the furniture sector. Given that sales and quantity sold have the strongest association with profit, the Technology sector should concentrate on growing these metrics. It is advisable to sustain sales volume and quantity of office supplies.

Discounts have a negative coefficient, according to the OLS regression, meaning that larger discounts are linked to poorer profitability. So, discount discounts could not always result in higher profitability, even though they are improving sales volume. Discount tactics can be used ; for example,offering discount on items that have greater margins when sold in bulk. similarly profit margins can be protected by setting discounts, where clients must spend a specific amount before receiving a discount.

Since the return on investment for technology appears to be higher, devotingÂ more resources to it, including marketing and sales initiatives would be a good strategy.

Bundling products together can boost sales volume without always raising discounts. Bundling can increase business margins and provide value for customers. Additionally, data can be used to determine which products are frequently purchased in combination.

To customize regional strategies, it will be beneficial to analyze the variations in consumer behavior, product choices, and market penetration among regions.To capitalize on current market advantages, investment can be expanded in high-profit areas

To reduce errors, it is necessary to check data entry and collection procedures on a regular basis. Using efficient processes for identifying and handling outliers is essential to preserving the analysis's integrity. As a result, analysis is more reliable and resistant to anomalies in the data.

## Limitations of the Methods used:

â˘ Brownlee, J., 2020. Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery. â˘ Hassan, A.A.H., Shah, W., Husein, A.M., Talib, M.S., Mohammed, A.A.J. and Iskandar, M., 2019. Clustering approach in wireless sensor networks based on K-means: Limitations and recommendations. Int. J. Recent Technol. Eng, 7(6), pp.119-126. â˘ Haruyama, T. (n.d.). wooldridge: Data sets from Introductory Econometrics: A Modern Approach (6th ed, J.M. Wooldridge). [online] PyPI. Available at: https://pypi.org/project/wooldridge/ [Accessed 17 Mar. 2024]. â˘ Kwak, S.K. and Kim, J.H., 2017. Statistical data preparation: management of missing values and outliers. Korean journal of anesthesiology, 70(4), p.407. â˘ Rout, A.R. (2020). ML - Advantages and Disadvantages of Linear Regression. [online] GeeksforGeeks. Available at: https://www.geeksforgeeks.org/ml-advantages-and-disadvantages-of-linear-regression/. â˘ Wu, B. (2021). K-means clustering algorithm and Python implementation. [online] IEEE Xplore. doi:https://doi.org/10.1109/CSAIEE54046.2021.9543260.