# C11BD Big Data Analytics: Coursework 2

Esme Irving H00321992

Word Count: 2064

18 March 2024

## Introduction

In the current business landscape, leveraging data analytics has become vital for companies striving to maintain competitiveness across diverse markets and industries (Cui et al., 2022). This report undertakes an in-depth analysis of the dataset 'Superstore.csv' to leverage data analytics to enhance profitability. The overarching goal is to unearth valuable insights that can guide strategic decision-making to improve the company's financial performance.

The analysis begins by importing the dataset and re-labelling columns to ensure clarity and coherence. Subsequently, it involves data cleaning, meticulously sifting through information to identify outliers and amend any skewed data which could distort the analysis. Addressing data gaps is prioritised, as accurate information can obscure comprehension and lead to consistent conclusions. These gaps are effectively filled through comprehensive exploration techniques, establishing a robust analytical framework.

Exploration continues with summary statistics, delving into the dataset's central tendencies, distributions, and variability. The report then transitions to visualising the data, employing various plotting techniques to highlight patterns, trends, and relationships.

Advanced analytical models, such as simple linear regression, multiple regression and K-means Clustering, are then applied to uncover insights and predictive relationships within the dataset. Identifying the drivers of profitability and recognising areas with possible will present the company with actionable insight and guide informed decision-making to maximise its potential.

Ultimately, this report aims to provide the client with the insights necessary to navigate the intricacies of the modern business landscape and drive greater profitability.

## Importing the Data

Run to view results

Run to view results

Run to view results

The first step in initiating exploration of the company's dataset is to ensure it meets the company's expectations and requirements. Utilising the '.describe()' function allowed a comprehensive dataset summary, revealing crucial statistical metrics, including mean, standard deviation, and various percentiles covering the entire data frame.

## Re-labelling the Data

Run to view results

Run to view results

Run to view results

Run to view results

Run to view results

Run to view results

Run to view results

Ensuring clarity, consistency, and relevance in the dataset was vital. Hence, the data columns within the 'Superstore.csv' dataset were re-labelled to align with the overarching objective. Leveraging the '.info()' function provided a comprehensive overview of the dataset, revealing any inconsistencies in column names that could affect subsequent analysis. Following this, employing the 'pd.unique()' function enabled the exploration of the unique values within each column, identifying inconsistencies and providing insights into the categorical variables in the dataset. This process enabled a comprehensive understanding of each column. Specifically, the focus was directed towards the 'Category', 'Sub-category', 'Segment', and 'State' columns, which were deemed important in conveying critical company information.

## Cleaning the Data

Run to view results

Run to view results

Focusing on ensuring data reliability, the dataset was cleaned by identifying and rectifying outliers that could skew the analysis. Following data cleaning, summary statistics were generated to gather insights into the characteristics of the refined dataset. The next step involved filtering out all discounts out with 0 and 1 to allow for a more focused examination of business practices, as discounts are commonly expressed as percentages, and data outside the 0 to 1 range could signify outliers or data entry errors.

## Gap Values

Run to view results

Run to view results

Run to view results

Run to view results

Examining the distribution and spread of data was imperative to understand the dataset's characteristics. This involved calculating the "gap" values between quartiles (Q1 and Q3), median, minimum, and maximum values. Examining profit value gaps allowed profit margin identification, as well as the ability to evaluate the variability of profit across products or regions, identifying prospective areas for cost optimisation.

## Summary statistics

Run to view results

Run to view results

Run to view results

Summary statistics are pivotal in exploratory data analysis and decision-making, most effectively illustrated by Anscombe’s quartet; they facilitate extracting meaningful insights and provide a deeper understanding of the data (Skiena, 2017). Sorting the data by profit in ascending order aimed to identify orders with the lowest profit margins, locating potential areas where the business might be experiencing losses or where profit margins are slim. This analysis offers insights for implementing cost-reduction measures, adjusting pricing strategies, or optimising products for improved profitability.

As a result of this, sorting the data frame by quantity in descending order allowed for identifying orders with the highest amounts sold. This step targeted popular products or categories with high demand, helping to recognise best-selling items, augmenting inventory management practices, and capitalising on high-demand products to drive sales and profitability. Furthermore, filtering the data frame to include only rows where the quantity column has a value of 15 or more provided a focused analysis of orders with large quantities, aiming to identify purchases or high-volume orders and reveal customer behaviour patterns.

## Plotting the data

### Profitability Analysis by Discount Level

Run to view results

Run to view results

Visualising data through scatter graphs reveals underlying patterns, trends, and relationships between variables. Scatter graphs are especially effective in representing the relationship between two continuous variables (Sharda et al., 2020). The graph presented shows a positive correlation between discount and profit, with generally, higher discounts coinciding with higher profits. That being said, notable variation exists, indicating that factors beyond this dataset also likely influence profit.

### Profitability, Losses and Order Quantity

Run to view results

Run to view results

Run to view results

Run to view results

Run to view results

Run to view results

Run to view results

Bar charts are the most basic yet effective method of visualising data (Sharda et al., 2020). Upon analysing all charts simultaneously, insights into regional performance regarding profitability, losses, and order quantity are gathered. Due to the extensive database, which included a substantial number of states, grouping by region was preferred over grouping by state. This decision simplified the analysis by focusing on broader geographical trends.

The primary bar chart, titled 'Profitable Sales Overview by Region,' illustrates the four regions—East, West, South, and Central—on the X-axis, with the Y-axis depicting the total number of profitable sales within each region. The West region has the highest number of profitable sales, closely followed by the East, with Central and South falling behind.

The second chart focuses on the ‘Negative Sales Overview by Region.' Similarly, the X-axis represents the four regions, while the Y-axis presents the total negative sales (resulting in a loss) within each region. This chart highlights regions with the highest number of negative sales, helping to identify areas requiring improvement. Analysing this chart alongside the first reveals that the East Region presents as highly profitable with significant negative sales. Possible explanations include marketing strategies such as frequent promotions driving sales or poor inventory management leading to negative sales due to overstocking or understocking.

The third chart, ‘High Order Volume by Region,' shows regions with a higher concentration of large orders. This information is useful for understanding regional buying patterns or potential bulk order trends. As with the other charts, the X-axis represents the four regions, while the Y-axis displays the total quantity of orders exceeding a certain number of units per region. This chart spotlights regions with the highest order volume.

### Profitable Sales Breakdown by Product Category

Run to view results

The bar chart illustrates the total number of profitable sales categorised into office supplies, furniture, and technology. The X-axis represents these three categories, while the Y-axis denotes the total number of profitable sales within each category. Office Supplies emerge as the category with the highest number of profitable sales, followed by Technology, with Furniture presenting the least, showing a mid-range position. Various interpretations can clarify why office supplies lead to profitable sales. This could stem from competitive pricing strategies driving increased sales, a broader product range, or higher profit margins associated with office supplies.

Analysing these visualisations collectively allows a comprehensive understanding of the dynamics between discounting strategies, regional performance, and product category preferences that influence profitability. The positive correlation between discount and profit suggests the significance of discounting strategies in driving profitability. However, the substantial variation implies that factors beyond discounts influence profit levels. The regional performance analysis from the bar charts reveals profitability variations across different regions, with the West region exhibiting the best performance and the Central and East regions facing challenges with negative sales. Furthermore, analysing the profitable sales by product category demonstrates that the 'Office Supplies' category leads in profitable sales, followed by 'Technology' and 'Furniture.' This indicates that specific product categories contribute more to overall profitability than others. Understanding regional variations in product category preferences, as indicated by ‘High Order Volume by Region, can further inform marketing and inventory management strategies to optimise profitability.

## Modelling the Data

### Simple Linear Regression Model: Predicting Profit based on Discount

Run to view results

Run to view results

### Multiple Linear Regression Model: Actual vs Predicted Profit Comparison by Discount

Run to view results

Run to view results

### K-Means Clustering Model

Run to view results

Run to view results

Three modelling techniques were employed to analyse the data: simple linear regression, Multiple Regression, and K-means clustering. Simple linear regression was chosen to model the relationship between a continuous variable and predictor variables, assuming a linear relationship (Skiena, 2017). Following this, Multiple Linear Regression, an extension of simple linear regression, was utilised to model the relationship between a continuous variable and multiple predictor variables. This technique allows for exploring how various factors simultaneously influence the target variable, providing insights into their interactions (Uyanik and Güler, 2013). Lastly, the K-Means Clustering Model, an unsupervised learning algorithm, was employed for exploratory data analysis and pattern discovery. Its scalability and efficiency present it to be particularly appealing for data analytics (Han et al., 2012).

The simple linear regression model produced a coefficient of 0, revealing that the independent variable does not affect the dependent variable; this may be due to a more complex relationship or data issues, such as missing values. The multiple regression analysis aimed to capture the combined effects of discounts, sales, and quantity on profit, providing insights into how these variables influence profit prediction. Despite accounting for these factors, the model indicated a discrepancy between predicted and actual profit, revealing a concerning trend of profitability declining notably, which was not within the expected scenario. This highlights the challenges in maintaining profit margins under discounting strategies.

The regression models required finding the ‘R-squared’ and ‘Adjusted R-squared values’. The results suggested a positive but weak linear relationship between discount, sales, quantity, and profit, demonstrating that while these variables can somewhat predict profit, other unaccounted factors likely influence profit variability.

Lastly, the K-Means Clustering Model visualised the dataset and categorised it by sales and profit. The scatter plot displays products sold, with sales represented on the X-axis and profit on the Y-axis. The clustering algorithm has grouped these products into distinct clusters, with the centroid (Cluster Centre) shown by a black point, and the separation between clusters indicating differences in performance. When interpreting the clusters, with the low-profit, low-sales clusters (Blue) – such as machines – these clusters exhibit low sales low sales profitability. They may represent underperforming or niche products, contributing minimally to overall profitability. In addition to this, there is the mid-range profit, mid-range sales clusters (Green) – including binders and phones – representing products with moderate sales and profitability, representing stable and consistent product performance. Products within this category could be improved for to improve efficiency and generate higher profit. A further category is noticed, the moderate-profit, low-sales cluster (purple) – including tables – showing a mid-range profit despite lower sales volumes. This could be attributed to factors such as seasonal trends or new products in the early stages of gaining traction. A final noticeable cluster point was the high-profit and high-sales clusters (pink) – including copiers and chairs – this cluster was characterised by both high profit and sales, representing the most successful and profitable product. These products are driving significant revenue and are likely to be key contributors to the company's overall profitability; therefore, capitalising on these products to sustain this growth could be valuable for improved profitability.

## Conclusion

To conclude, valuable insights have been gathered through a comprehensive analysis of the ‘Superstore.csv.’ dataset to guide strategic decision-making and improve profitability. This report has provided a deeper understanding of the relationships between discounts, sales, quantity, and profit by employing various analytical techniques, including linear regression and K-means clustering. However, it must be recognised that while these analyses offer valuable insights, other unaccounted factors may influence results. Therefore, further exploration and analysis of additional variables are necessary for a more holistic understanding and prediction of profitability in this dataset.

### Recommendations

• To invest in further data collection and analysis to identify additional factors influencing profitability, such as customer demographics and market trends • Advanced analytics techniques, including sentiment analysis and predictive modelling, to forecast market trends and consumer attitude changes accurately. • Enhance inventory management practices by leveraging clustering insights to identify high-demand products and seasonal trends and utilise inventory management tools to streamline procurement and distribution processes. • To optimise insights from regression analysis to implement dynamic pricing strategies, benefiting from machine learning algorithms to maximise revenue and profitability whilst remaining competitive in the market.

## References:

Cui, Y. et al. (2022) ‘The influence of big data analytic capabilities building and education on business innovation’, Front Psychol. DOI: 10.3389/fpsyg.2022.999944.

Han, J., Kamber, M., Pei, J. (2012) Data Mining: Concepts and Techniques. 3rd Edition, Morgan Kaufmann.

Sharda, R. et al. (2020) Systems for Analytics, Data Science, and Artificial Intelligence: Systems for Decision Support, Global Edition. 11th Edition, Pearson.

Skiena, S.S. (2017) The Data Science Design Manual. Cham: Springer Nature.

Uyanik, K.G. and Güler, N. (2013) ‘A Study on Multiple Linear Regression Analysis’, Procedia – Social and Behavioral Sciences’, 106, pp.234-240.