C11BD Big Data Analytics 2023-2024: Individual Coursework 2

submitted by: Faiqa Niaz :H00458048

Intorduction

The Super store company's operational trends and key performance metrics are analyzed in this research. Comprehensive examination of metrics pertaining to revenue, profitability, merchandise, and customers are evaulated. The study aims to offer opportunities for development and areas for enhancement to support profitable strategic decision-making processes. The company's stakeholders have the authority to comprehend and take action on the insights given by the concise explanation and lucid illustrations that are displayed. The report encourages data-driven decision-making and methods to boost earnings in order to maintain the business agile and responsive to changing market conditions.

3.1: Importing data

Importing Libraries

The installation of the Wooldridge package to use pip will be the first step in the procedure. Then, important libraries will be imported from pythons. These libraries facilitate in data manipulation, data visualization, plotting data, doing numerical calculations and statistical function.

Importing Data Set

So, First of all data set 'dataset_Superstore' is uploaded in new data fram

After this we will be able to clean and analyze the data

Data Set Examination

To start the cleaning data set is examined

Tha data has no missing values. However the date format for two columns Shaip Date and Ship Order are not correct.

3.2 Cleaning the data

Converting Data types

In the following code data types of two columns({Order Date)} and ({Ship Date)} are converted

Dates may be easily manipulated and analyzed using datetime format, making it possible to perform activities like grouping by time periods.

Check for Duplicates

This code checks for and removes duplicate rows from the DataFrame "df"

Eliminating duplicates guarantees unique observations and mitigates skewed analytical outcomes, hence enhancing the precision of analyses.

Removing Outliers

Outliers impact the result of data. So, it is very importonat to indentify and remove outliers for accurate results (Kwak et al.2017). Outliers are removed from Sale, profit and quantity data using IQR method. The code blocks that follow locate and eliminate outliers from the dataset's four columns. It defines the `remove_outliers` function, which accepts as inputs a DataFrame {df}) and a column name ({col_name}). For the given column, it computes the interquartile range (IQR), first quartile (Q1), and third quartile (Q3) . Finally it assigns the filtered DataFrame back to `df` each time, effectively removing outliers from each specified column.

The code effectively eliminates outliers, thus improving the data's dependability and integrity for later examination.. This systematic approach ensures that the dataset remains robust and conducive to meaningful analysis, ultimately contributing to more accurate and insightful conclusions regarding the factors influencing profitability within the company.

3.3: Statistical summary of the cleaned data:

A statistical summary for DataFrame df with the following columns can be generated as follows:

The count shows that each column contains 6703 non-null data. The mean denotes each variable's average value. The average profit, for instance, is 11.23.

The standard deviation of each variable indicates how dispersed the data points are from the mean. Greater data variety is indicated by a higher standard deviation. Here, the profit (13.01) and sales (84.47) have the biggest standard deviations, indicating a larger range of values than quantity (1.86) and discount (0.10). The lowest and maximum values for each variable are shown, respectively, by min and max. For instance, the minimum sales amount was 0.99, while the maximum sales were 496.86.

Description of New Data:

The df.info() function provides a concise overview of the DataFrame df, including information on data types, non-null values, and memory usage. It is useful to understanding the DataFrame.

Additionally, the DataFrame df's first five rows are output via the print(df.head()) command, making it possible to quickly examine the dataset

Non-null counts show that there are no missing values in the data, and all data types are suitable for the columns in which they belong.

Size of New Data Frame

Let's Check the shape of data after eleiminating outliers. This will give us new count of Rows

After processing the data to eliminate outliers, 6703 rows and 29 columns make up the dataset.

3.4: Plotting

Bar Charts

To see the relationship between profit and region and category, this code creates two bar charts.

The average profit across product categories is displayed in the bar chart on the left, where technology products generate the largest average profit when compared to office supplies and furnishings. This suggests that the might consider focusing more on products in the "Technology" category in order to increase sales. Likewise, the greater profit margin in the "West region" indicates that there are advantageous market circumstances, robust client demand, and viable business plans that may be pursued there.

Scatter Plot:

The scatter plot produced by this code shows the link between "Sales" and "Profit.

The scatter plot indicates that "Profit" and "Sales" have significant association. There is a substantial positive correlation between these two variables, indicating that profit growth tends to follow sales increase. This suggests that increasing sales strategies could have a big effect on profitability.

Correlation Matrix

Profit and sales are positively correlated (about 0.37), indicating that profit typically rises with increase in sales. A stronger positive association can be shown between "Quantity" and "Profit" but is is less strong as compared to relation between "Sales" and "Profit.". There is a negative correlation (about -0.28) between "Discount" and "Profit," indicating that bigger discounts may result in fewer profits. The scatter plot indicates that "Profit" and "Sales" have the most significant association. This suggests that increasing sales strategies could significantly affect profitability.

3.5: Modelling

k-means clustering

A popular machine learning technique for splitting data into discrete groups or clusters is K-means clustering. The objective is to group data points into K number of clusters, where the user specifies K. When working with unlabeled data, K-means clustering is useful because it can identify inherent groupings without the requirement for pre-established categories. Furthermore, because K-means is scalable, it can handle big datasets effectively after the ideal number of clusters is established. For activities like consumer segmentation, K-means clustering is frequently utilized (Wu, 2021).

A cluser with a higher discount rate but comparatively low sales and profit is represented by Cluster 0. This shows that sales and profit margins are not greatly increased even with the larger discounts being given. So, discount techniques in this area might need to be reevaluated.

Out of the three clusters, Cluster 1 has the lowest sales but, surprisingly, the largest profit, along with a very negligible discount rate. This segment can be defined as high-efficiency if sales generate more profit due to high-margin goods or efficient cost management. It implies that keeping discount rates low doesn't hurt profitability

With a reasonable discount rate, Cluster 2 exhibits the highest sales and profit figures. This implies striking a sensible balance between preserving profit margins and encouraging sales through discounts. It can mean that there is a great demand for the products in this segment.

Recommendations

Given the different discount impacts in Clusters 0 and 2, it appears that applying discounts wisely is important. Finding the items or categories in Cluster 0 that don't profit from discounts is essential, and the plan should be modified accordingly.

One advantage of Cluster 1 is its strong profitability at low discount rates. Therefore, it's critical to concentrate on the factors that influence profitability in this segment. For example it can be finding best product mix.

The profitable and large sales of Cluster 2 show a successful matching of product and market. Thus it is i necessary to analyze the features of the items that make up this cluster.

Using cluster information to tailor marketing strategies is advised. Customers in the high-profit, low-discount cluster (Cluster 1), for instance, might be targeted with loyalty programs or premium products, while those in Cluster 2 might be the focus of strategically timed discounts to increase volume sales

Regression Analysis

In the following codes OLS Regression analysis model is used to link between the outcome variable, "Profit," and other three variables. Method used in linear regression is the Ordinary Least Squares (OLS) approach, which minimizes the sum of squared errors to estimate coefficients. It requires little computer power and is a frequently used because of its simplicity. It offers a range of statistical tests, such as F-test and R-squared. OLS regression coefficients are a useful analytical tool because they provide information on the relationships between independent and dependent variables (Rout, 2020).

The average of error squares, or the average squared difference between the estimated values and true value, is measured by an estimator's mean squared error (MSE).

The predictive performance of the model is measured by the cross-validated Mean Squared Error (MSE). A mean mean square error of -124.086 indicates that although the model can make predictions, its accuracy may be enhanced. In the future, predictive performance can be improved by minimizing MSE in models.

Regarding the effect of discounts on profit, the model's conclusion is supported by the significant T-statistic (-20.518) and P-value (around 0). So, discounts have a substantial impact on profitability and should thus be carefully managed

Multicollinearity Checking

It is importabt to look for multicollinearity amongst the predictor variables before drawing any conclusions or predictions from the regression model.

All VIF values are close to 1 which means that multicollinearity is not a significant concern here. This implies that the independent variables justify their inclusion in the model.

Segment Analysis

Segment analysis is done using the following code. It will examine how various segments influence profit.

Conclusions

With an R-squared of 0.304, the model appears to be able to explain about 30.4% of the variation in profit within the Office Supplies category. Profit is positively correlated with sales and quantity, meaning that if these factors rise, profit will aslo rise. However, the discount's negative coefficient (-20.9474) suggests that lower profit margins in this category are due to larger discounts.

With an R-squared of 0.379, the model is able to account for 37.9% of the variation in profit across the Furniture category. This is a moderate link. Sales and quantity have a positive association with profit, however discounts have negative influence, with a coefficient of -86.6221. This shows that in furniture sector discounts have negative affect on profit. So, decrease in discounts could increase profitability.

This model explains 47.8% of the variability in profit, which is a stronger link than the other two categories. Its R-squared value is 0.478. The positive coefficients of quantity (0.1947) and sales (0.11010) , demonstrate that these elements can raise profit in the Technology area. The discount coeffiecient is negative(-60.455), indicating that it is affecting profits.

Recomendations

In order to make a balance between increasing sales and preserving profit margins, the discount approach needs to be reevaluated, particularly in the furniture sector. Given that sales and quantity sold have the strongest association with profit, the Technology sector should concentrate on growing these metrics. It is advisable to sustain sales volume and quantity of office supplies.

Discounts have a negative coefficient, according to the OLS regression, meaning that larger discounts are linked to poorer profitability. So, discount discounts could not always result in higher profitability, even though they are improving sales volume. Discount tactics can be used ; for example,offering discount on items that have greater margins when sold in bulk. similarly profit margins can be protected by setting discounts, where clients must spend a specific amount before receiving a discount.

Since the return on investment for technology appears to be higher, devoting more resources to it, including marketing and sales initiatives would be a good strategy.

Bundling products together can boost sales volume without always raising discounts. Bundling can increase business margins and provide value for customers. Additionally, data can be used to determine which products are frequently purchased in combination.

To customize regional strategies, it will be beneficial to analyze the variations in consumer behavior, product choices, and market penetration among regions.To capitalize on current market advantages, investment can be expanded in high-profit areas

To reduce errors, it is necessary to check data entry and collection procedures on a regular basis. Using efficient processes for identifying and handling outliers is essential to preserving the analysis's integrity. As a result, analysis is more reliable and resistant to anomalies in the data.

Limitations of the Methods used:

OLS regression is extremely sensitive to data outliers. The predicted coefficients can be greatly impacted by outliers, which can also skew the overall model fit. (Rout, 2020). The relationship between the dependent and independent variables is assumed to have a particular functional shape in OLS regression. The OLS estimations could be skewed if this presumption is broken.

The requirement to predetermine the number of clusters (k) is one of the primary disadvantages of K-means clustering. This can be difficult since selecting the incorrect value can produce unsatisfactory outcomes(Hassan, 2019). Data with different cluster densities and sizes are difficult for k-means to cluster. (Hassan, 2019)

The IQR's drawback is that it ignores the distribution of the remaining observations and only takes into account the middle 50% of the data (Brownlee, 2020)

• Brownlee, J., 2020. Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery. • Hassan, A.A.H., Shah, W., Husein, A.M., Talib, M.S., Mohammed, A.A.J. and Iskandar, M., 2019. Clustering approach in wireless sensor networks based on K-means: Limitations and recommendations. Int. J. Recent Technol. Eng, 7(6), pp.119-126. • Haruyama, T. (n.d.). wooldridge: Data sets from Introductory Econometrics: A Modern Approach (6th ed, J.M. Wooldridge). [online] PyPI. Available at: https://pypi.org/project/wooldridge/ [Accessed 17 Mar. 2024]. • Kwak, S.K. and Kim, J.H., 2017. Statistical data preparation: management of missing values and outliers. Korean journal of anesthesiology, 70(4), p.407. • Rout, A.R. (2020). ML - Advantages and Disadvantages of Linear Regression. [online] GeeksforGeeks. Available at: https://www.geeksforgeeks.org/ml-advantages-and-disadvantages-of-linear-regression/. • Wu, B. (2021). K-means clustering algorithm and Python implementation. [online] IEEE Xplore. doi:https://doi.org/10.1109/CSAIEE54046.2021.9543260.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}C11BD Big Data Analytics 2023-2024: Individual Coursework 2