C11BD Big Data Analytics: Coursework 2

Esme Irving H00321992

Word Count: 2064

18 March 2024

Introduction

In the current business landscape, leveraging data analytics has become vital for companies striving to maintain competitiveness across diverse markets and industries (Cui et al., 2022). This report undertakes an in-depth analysis of the dataset 'Superstore.csv' to leverage data analytics to enhance profitability. The overarching goal is to unearth valuable insights that can guide strategic decision-making to improve the company's financial performance.

The analysis begins by importing the dataset and re-labelling columns to ensure clarity and coherence. Subsequently, it involves data cleaning, meticulously sifting through information to identify outliers and amend any skewed data which could distort the analysis. Addressing data gaps is prioritised, as accurate information can obscure comprehension and lead to consistent conclusions. These gaps are effectively filled through comprehensive exploration techniques, establishing a robust analytical framework.

Exploration continues with summary statistics, delving into the dataset's central tendencies, distributions, and variability. The report then transitions to visualising the data, employing various plotting techniques to highlight patterns, trends, and relationships.

Advanced analytical models, such as simple linear regression, multiple regression and K-means Clustering, are then applied to uncover insights and predictive relationships within the dataset. Identifying the drivers of profitability and recognising areas with possible will present the company with actionable insight and guide informed decision-making to maximise its potential.

Ultimately, this report aims to provide the client with the insights necessary to navigate the intricacies of the modern business landscape and drive greater profitability.

Importing the Data

# Importing the pandas library as pd for data analysis import pandas as pd superstore = pd.read_csv("dataset_Superstore.csv") # Checking the type of the variable 'superstore' to ensure data has been read successfully into the DataFrame type(superstore)

Run to view results

# Displaying the contents of the DataFrame superstore

Run to view results

# Generating summary statistics for insights superstore.describe()

Run to view results

The first step in initiating exploration of the company's dataset is to ensure it meets the company's expectations and requirements. Utilising the '.describe()' function allowed a comprehensive dataset summary, revealing crucial statistical metrics, including mean, standard deviation, and various percentiles covering the entire data frame.

Re-labelling the Data

# Re-labelling the columns of the DataFrame for accuracy superstore.columns = ["Row ID", "Order ID","Order Date", "Ship Date","Ship Mode","Customer ID","Customer Name","Customer_no","Segment","Segment_no","Country","City","State","State_no","Postal Code","Region","Region_no","Product ID","Category","Category_no","Sub-Category","Sub-Category_no","Product Name","Product Name_no","Sales","Quantity","Discount","Profit","Returned"]

Run to view results

# Displaying consice summary information superstore.info()

Run to view results

# Extracting unique values from the "Category" column pd.unique(superstore["Category"])

Run to view results

# Extracting unique values from the "Sub-Category" column pd.unique(superstore["Sub-Category"])

Run to view results

# Extracting unique values from the "Segment" column pd.unique(superstore["Segment"])

Run to view results

# # Extracting unique values from the "State" column pd.unique(superstore["State"])

Run to view results

# Extracting unique values from the "Region" column pd.unique(superstore["Region"])

Run to view results

Ensuring clarity, consistency, and relevance in the dataset was vital. Hence, the data columns within the 'Superstore.csv' dataset were re-labelled to align with the overarching objective. Leveraging the '.info()' function provided a comprehensive overview of the dataset, revealing any inconsistencies in column names that could affect subsequent analysis. Following this, employing the 'pd.unique()' function enabled the exploration of the unique values within each column, identifying inconsistencies and providing insights into the categorical variables in the dataset. This process enabled a comprehensive understanding of each column. Specifically, the focus was directed towards the 'Category', 'Sub-category', 'Segment', and 'State' columns, which were deemed important in conveying critical company information.

Cleaning the Data

# Filtering the 'superstore' DataFrame to include only rows where the 'Sales' column has values greater than or equal to 1, # removing data entry errors or outliers Sales_Cleaned = superstore[superstore['Sales']>=1] # Generating summary statistics for cleaned 'Sales' data insights Sales_Cleaned.describe()

Run to view results

# Filtering the 'Sales_Cleaned' DataFrame to include only rows where the discount column has values between 0 and 1, # to ensure discounts are within an appropriate range Clean_superstore = Sales_Cleaned[(Sales_Cleaned['Discount'] >= 0) & (Sales_Cleaned['Discount'] <= 1)] # Generating summary statistics for cleaned 'Sales' data insights Clean_superstore.describe()

Run to view results

Focusing on ensuring data reliability, the dataset was cleaned by identifying and rectifying outliers that could skew the analysis. Following data cleaning, summary statistics were generated to gather insights into the characteristics of the refined dataset. The next step involved filtering out all discounts out with 0 and 1 to allow for a more focused examination of business practices, as discounts are commonly expressed as percentages, and data outside the 0 to 1 range could signify outliers or data entry errors.

Gap Values

# Specifying required column Column = 'Sales' # Using describe() to find summary statistics summary_statistics = Clean_superstore[Column].describe() # Extracting specific summary statistics median_value = summary_statistics['50%'] min_value = summary_statistics['min'] max_value = summary_statistics['max'] q1_value = summary_statistics['25%'] q3_value = summary_statistics['75%'] # Computing the gaps between quartiles and other values gap_q1_min = q1_value - min_value gap_median_q1 = median_value - q1_value gap_max_q3 = max_value - q3_value gap_q3_median = q3_value - median_value # Print summary statistics print(f"Median: {median_value}") print(f"Minimum: {min_value}") print(f"Maximum: {max_value}") print(f"1st Quartile (Q1): {q1_value}") print(f"3rd Quartile (Q3): {q3_value}") # Print the gaps print(f"Gap between 1st Quartile and min value: {gap_q1_min}") print(f"Gap between Median and 1st Quartile (Q1) value: {gap_median_q1}") print(f"Gap between 3rd Quartile (Q3) and Median: {gap_q3_median}") print(f"Gap between Max and 3rd Quartile (Q3): {gap_max_q3}")

Run to view results

# Specifying required column Column = 'Quantity' # Using describe() to find summary statistics summary_statistics = Clean_superstore[Column].describe() # Extracting specific summary statistics median_value = summary_statistics['50%'] min_value = summary_statistics['min'] max_value = summary_statistics['max'] q1_value = summary_statistics['25%'] q3_value = summary_statistics['75%'] # Computing the gaps between quartiles and other values gap_q1_min = q1_value - min_value gap_median_q1 = median_value - q1_value gap_max_q3 = max_value - q3_value gap_q3_median = q3_value - median_value # Print summary statistics print(f"Median: {median_value}") print(f"Minimum: {min_value}") print(f"Maximum: {max_value}") print(f"1st Quartile (Q1): {q1_value}") print(f"3rd Quartile (Q3): {q3_value}") # Print the gaps print(f"Gap between 1st Quartile and min value: {gap_q1_min}") print(f"Gap between Median and 1st Quartile (Q1) value: {gap_median_q1}") print(f"Gap between 3rd Quartile (Q3) and Median: {gap_q3_median}") print(f"Gap between Max and 3rd Quartile (Q3): {gap_max_q3}")

Run to view results

# Specifying required column Column = 'Discount' # Using describe() to find summary statistics summary_statistics = Clean_superstore[Column].describe() # Extracting specific summary statistics median_value = summary_statistics['50%'] min_value = summary_statistics['min'] max_value = summary_statistics['max'] q1_value = summary_statistics['25%'] q3_value = summary_statistics['75%'] # Computing the gaps between quartiles and other values gap_q1_min = q1_value - min_value gap_median_q1 = median_value - q1_value gap_max_q3 = max_value - q3_value gap_q3_median = q3_value - median_value # Print summary statistics print(f"Median: {median_value}") print(f"Minimum: {min_value}") print(f"Maximum: {max_value}") print(f"1st Quartile (Q1): {q1_value}") print(f"3rd Quartile (Q3): {q3_value}") # Print the gaps print(f"Gap between 1st Quartile and min value: {gap_q1_min}") print(f"Gap between Median and 1st Quartile (Q1) value: {gap_median_q1}") print(f"Gap between 3rd Quartile (Q3) and Median: {gap_q3_median}") print(f"Gap between Max and 3rd Quartile (Q3): {gap_max_q3}")

Run to view results

# Specifying required column Column = 'Profit' # Using describe() to find summary statistics summary_statistics = Clean_superstore[Column].describe() # Extracting specific summary statistics median_value = summary_statistics['50%'] min_value = summary_statistics['min'] max_value = summary_statistics['max'] q1_value = summary_statistics['25%'] q3_value = summary_statistics['75%'] # Computing the gaps between quartiles and other values gap_q1_min = q1_value - min_value gap_median_q1 = median_value - q1_value gap_max_q3 = max_value - q3_value gap_q3_median = q3_value - median_value # Print summary statistics print(f"Median: {median_value}") print(f"Minimum: {min_value}") print(f"Maximum: {max_value}") print(f"1st Quartile (Q1): {q1_value}") print(f"3rd Quartile (Q3): {q3_value}") # Print the gaps print(f"Gap between 1st Quartile and min value: {gap_q1_min}") print(f"Gap between Median and 1st Quartile (Q1) value: {gap_median_q1}") print(f"Gap between 3rd Quartile (Q3) and Median: {gap_q3_median}") print(f"Gap between Max and 3rd Quartile (Q3): {gap_max_q3}")

Run to view results

Examining the distribution and spread of data was imperative to understand the dataset's characteristics. This involved calculating the "gap" values between quartiles (Q1 and Q3), median, minimum, and maximum values. Examining profit value gaps allowed profit margin identification, as well as the ability to evaluate the variability of profit across products or regions, identifying prospective areas for cost optimisation.

Summary statistics

# Extracting a subset of columns from the 'Clean_superstore' DataFrame Summary_Figures = Clean_superstore[["Order ID","Sales","Quantity","Discount","Profit"]] # Sorting the 'Summary_Figures' based on the 'Profit' column in ascending order NewProfit = Summary_Figures.sort_values(by="Profit",ascending=True) # Print the first 50 rows print(NewProfit.head(50))

Run to view results

# Extracting a subset of columns from the 'Clean_superstore' DataFrame Summary_Figures = Clean_superstore[["Order ID","Sales","Quantity","Discount","Profit"]] # Sorting the 'Summary_Figures' based on the 'Quantity' column in descending order NewProfit = Summary_Figures.sort_values(by="Quantity",ascending=False) # Print the first 50 rows print(NewProfit.head(50))

Run to view results

# Filtering the 'Clean_superstore' DataFrame to include only rows where 'Quantity' has values greater than, # or equal to 15 and selecting the desired columns in the same step LargeQuantities = Clean_superstore.loc[Clean_superstore["Quantity"] >= 15, ["Order ID","Product Name","Category","Sales","Quantity","Discount","Profit"]] # Print the large quantities print("LargeQuantites", LargeQuantities)

Run to view results

Summary statistics are pivotal in exploratory data analysis and decision-making, most effectively illustrated by Anscombe’s quartet; they facilitate extracting meaningful insights and provide a deeper understanding of the data (Skiena, 2017). Sorting the data by profit in ascending order aimed to identify orders with the lowest profit margins, locating potential areas where the business might be experiencing losses or where profit margins are slim. This analysis offers insights for implementing cost-reduction measures, adjusting pricing strategies, or optimising products for improved profitability.

As a result of this, sorting the data frame by quantity in descending order allowed for identifying orders with the highest amounts sold. This step targeted popular products or categories with high demand, helping to recognise best-selling items, augmenting inventory management practices, and capitalising on high-demand products to drive sales and profitability. Furthermore, filtering the data frame to include only rows where the quantity column has a value of 15 or more provided a focused analysis of orders with large quantities, aiming to identify purchases or high-volume orders and reveal customer behaviour patterns.

Plotting the data

Profitability Analysis by Discount Level

# Importing libraries import matplotlib.pyplot as plt import seaborn as sns

Run to view results

# Creating figure fig, ax = plt.subplots() # Creating scatter plot with axes colour points ax.scatter(Clean_superstore['Discount'],Clean_superstore['Profit'], color = 'orange') # Setting colour for x and y axes ax.spines['bottom'].set_color('blue') ax.spines['left'].set_color('green') # Set labels and titles plt.xlabel('Discount(%)', color = 'blue') plt.ylabel('Profit($)', color = 'green') plt.title("Profitability Analysis by Discount Level") # Show plot plt.show()

Run to view results

Visualising data through scatter graphs reveals underlying patterns, trends, and relationships between variables. Scatter graphs are especially effective in representing the relationship between two continuous variables (Sharda et al., 2020). The graph presented shows a positive correlation between discount and profit, with generally, higher discounts coinciding with higher profits. That being said, notable variation exists, indicating that factors beyond this dataset also likely influence profit.

Profitability, Losses and Order Quantity

# Selecting columns using .loc GraphData = Clean_superstore.loc[:, ["State", "Region", "Category","Sub-Category", "Sales", "Quantity","Discount", "Profit"]] # Print the data print(GraphData)

Run to view results

# Filtering the DataFrame to only include rows where Profit is greater than 0 PositiveProfit = GraphData[(GraphData['Profit']>0)] # Print the first 20 rows print(PositiveProfit.head(20))

Run to view results

# Grouping by 'Region' and counting the number of profitable sales RegionPositive = PositiveProfit.groupby('Region', sort=False, as_index=False) RegionPositiveSum = RegionPositive['Sales'].count() # Creating bar plot plt.bar(RegionPositiveSum['Region'], RegionPositiveSum['Sales'], color = 'orange') # Setting axes and title plt.xlabel('Region', color='blue') plt.ylabel('Sales($)', color='green') plt.title('Profitable Sales Overview by Region') # Setting colour of X and Y axes plt.gca().spines['bottom'].set_color('blue') plt.gca().spines['left'].set_color('green') # Show plot plt.show()

Run to view results

# Selecting rows where profit is negative NegativeProfit = GraphData[(GraphData['Profit']<0)] # Print first 10 rows print(NegativeProfit.head(10))

Run to view results

# Grouping by 'Region' and counting the number of profitable sales RegionNegative = NegativeProfit.groupby('Region', sort=False, as_index=False) RegionNegativeSum = RegionNegative['Sales'].count() # Creating bar plot plt.bar(RegionNegativeSum['Region'], RegionNegativeSum['Sales'], color = 'orange') # Setting axes and title plt.xlabel('Region', color='blue') plt.ylabel('Sales($)', color='green') plt.title('Negative Sales Overview by region') # Setting colour of X and Y axes plt.gca().spines['bottom'].set_color('blue') plt.gca().spines['left'].set_color('green') # Show plot plt.show()

Run to view results

# Selecting rows where quantity is greater than 999 LargeQuantity = GraphData[(GraphData['Quantity']>999)] # Print the first 10 rows print(LargeQuantity.head(10))

Run to view results

# Grouping by 'Region' and summing the quantites RegionQuantity = LargeQuantity.groupby('Region', sort=False, as_index=False) RegionQuantitySum = RegionQuantity['Quantity'].sum() # Creating a bar plot plt.bar(RegionQuantitySum['Region'], RegionQuantitySum['Quantity'], color = 'orange') # Setting labels and titles plt.xlabel('Region', color='blue') plt.ylabel('Quantity(Qty)', color='green') plt.title('High Order Volume by Region') # Setting the colour of X and Y axes plt.gca().spines['bottom'].set_color('blue') plt.gca().spines['left'].set_color('green') # Show plot plt.show()

Run to view results

Bar charts are the most basic yet effective method of visualising data (Sharda et al., 2020). Upon analysing all charts simultaneously, insights into regional performance regarding profitability, losses, and order quantity are gathered. Due to the extensive database, which included a substantial number of states, grouping by region was preferred over grouping by state. This decision simplified the analysis by focusing on broader geographical trends.

The primary bar chart, titled 'Profitable Sales Overview by Region,' illustrates the four regions—East, West, South, and Central—on the X-axis, with the Y-axis depicting the total number of profitable sales within each region. The West region has the highest number of profitable sales, closely followed by the East, with Central and South falling behind.

The second chart focuses on the ‘Negative Sales Overview by Region.' Similarly, the X-axis represents the four regions, while the Y-axis presents the total negative sales (resulting in a loss) within each region. This chart highlights regions with the highest number of negative sales, helping to identify areas requiring improvement. Analysing this chart alongside the first reveals that the East Region presents as highly profitable with significant negative sales. Possible explanations include marketing strategies such as frequent promotions driving sales or poor inventory management leading to negative sales due to overstocking or understocking.

The third chart, ‘High Order Volume by Region,' shows regions with a higher concentration of large orders. This information is useful for understanding regional buying patterns or potential bulk order trends. As with the other charts, the X-axis represents the four regions, while the Y-axis displays the total quantity of orders exceeding a certain number of units per region. This chart spotlights regions with the highest order volume.

Profitable Sales Breakdown by Product Category

# Grouping by 'Category' and counting the number of profitable sales RegionPositive = PositiveProfit.groupby('Category', sort=False, as_index=False) RegionPositiveSum = RegionPositive['Sales'].count() # Creating a bar plot bars = plt.bar(RegionPositiveSum['Category'],RegionPositiveSum['Sales']) # Define colours for each bar bars[0].set_color('red') bars[1].set_color('yellow') bars[2].set_color('grey') # Setting labels and title plt.xlabel('Category', color='blue') plt.ylabel('Sales($)', color='green') plt.title('Profitable Sales Breakdown by Product Category') # Setting the colour of X and Y axes plt.gca().spines['bottom'].set_color('blue') plt.gca().spines['left'].set_color('green') # Show plot plt.show()

Run to view results

The bar chart illustrates the total number of profitable sales categorised into office supplies, furniture, and technology. The X-axis represents these three categories, while the Y-axis denotes the total number of profitable sales within each category. Office Supplies emerge as the category with the highest number of profitable sales, followed by Technology, with Furniture presenting the least, showing a mid-range position. Various interpretations can clarify why office supplies lead to profitable sales. This could stem from competitive pricing strategies driving increased sales, a broader product range, or higher profit margins associated with office supplies.

Analysing these visualisations collectively allows a comprehensive understanding of the dynamics between discounting strategies, regional performance, and product category preferences that influence profitability. The positive correlation between discount and profit suggests the significance of discounting strategies in driving profitability. However, the substantial variation implies that factors beyond discounts influence profit levels. The regional performance analysis from the bar charts reveals profitability variations across different regions, with the West region exhibiting the best performance and the Central and East regions facing challenges with negative sales. Furthermore, analysing the profitable sales by product category demonstrates that the 'Office Supplies' category leads in profitable sales, followed by 'Technology' and 'Furniture.' This indicates that specific product categories contribute more to overall profitability than others. Understanding regional variations in product category preferences, as indicated by ‘High Order Volume by Region, can further inform marketing and inventory management strategies to optimise profitability.

Modelling the Data

Simple Linear Regression Model: Predicting Profit based on Discount

# Importing libraries import numpy as np from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt # Extracting features and variables X = Clean_superstore[['Discount']] Y = Clean_superstore['Profit'] # Creating and fitting linear regression model model = LinearRegression() model.fit(X, Y) # Predicting the target variable Y_pred = model.predict(X) # Plotting the data points and the regression line plt.scatter(X, Y, label='Data') plt.plot(X, Y_pred, color='red', label='Regression Line') plt.xlabel('Discount', color='blue') plt.ylabel('Profit($)', color='green') plt.title('Linear Regression: Predicting Profit based on Discount') plt.legend() # Setting the colour of the X and Y axis plt.gca().spines['bottom'].set_color('blue') plt.gca().spines['left'].set_color('green') # Show plot plt.show()

Run to view results

# Importing libraries from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score # Extracting features and target variable X = Clean_superstore[['Discount']] Y = Clean_superstore[['Profit']] # Creating Linear Regression model model = LinearRegression() # Fitting the model model.fit(X, Y) # Making predictions Y_pred = model.predict(X) # Calculating the R-squared to evaluate the models performance r_squared = r2_score(Y, Y_pred) # Print R-squared value print("R-squared:", r_squared)

Run to view results

Multiple Linear Regression Model: Actual vs Predicted Profit Comparison by Discount

# Importing from libraries import numpy as np from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt # Defining independent variables (X) and dependent variable (Y) X = Clean_superstore[['Discount','Sales','Quantity']] Y = Clean_superstore['Profit'] # Creating model model = LinearRegression() model.fit(X, Y) # Predicitng the dependent variable based on the independent variables Y_pred = model.predict(X) # Scatter plot of actual and predicted profit against discount plt.scatter(Clean_superstore['Discount'], Y, label='Actual profit') plt.scatter(Clean_superstore['Discount'], Y_pred, color='red', label='Predicted Profit') # Setting labels and title plt.xlabel('Discount', color='blue') plt.ylabel('Profit($)', color='green') plt.title('Actual vs Predicted Profit Comparison by Discount') # Setting colours for axes plt.gca().spines['bottom'].set_color('blue') plt.gca().spines['left'].set_color('green') plt.legend() # Showing plot plt.show()

Run to view results

# Importing libraries from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score # Extracting features and variables X = Clean_superstore[['Discount','Sales','Quantity']] Y = Clean_superstore[['Profit']] # Creating model model = LinearRegression() model.fit(X, Y) # Making predictions on the dependent variable based on the independent variables Y_pred = model.predict(X) # Calculate R-squared r_squared = r2_score(Y, Y_pred) # Calculate the adjusted R-squared n = X.shape[0] p = X.shape[1] adjusted_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - p - 1) # Print R-squared and adjusted R-squared print("R-squared:", r_squared) print("Adjusted R-squared:", adjusted_r_squared)

Run to view results

K-Means Clustering Model

# Assigning X variable x = Clean_superstore.iloc[:, [21, 25]].values # Importing libraries from sklearn.cluster import KMeans # Creating an empty list to store the sum of squres within the cluster wcss = [] # Specifying range cluster_range = range(1,11) # Iterating over each cluster count for i in cluster_range: # Creating a KMeans instance kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0) # Fitting the model kmeans.fit(x) # Adding element (WCSS) to the end of the list with append() wcss.append(kmeans.inertia_)

Run to view results

# Importing libraries import matplotlib.pyplot as plt import seaborn as sns # Creating grid sns.set_style("white") g = sns.FacetGrid(Clean_superstore, hue ="Sub-Category",height = 6) # Mapping scatter plot and cluster centres g.map(plt.scatter, 'Sales', 'Profit') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=100, c='black', label='Cluster Center') plt.title('K-Means Clustering Model: Profit vs Sales') plt.legend() # Show plot plt.show()

Run to view results

Three modelling techniques were employed to analyse the data: simple linear regression, Multiple Regression, and K-means clustering. Simple linear regression was chosen to model the relationship between a continuous variable and predictor variables, assuming a linear relationship (Skiena, 2017). Following this, Multiple Linear Regression, an extension of simple linear regression, was utilised to model the relationship between a continuous variable and multiple predictor variables. This technique allows for exploring how various factors simultaneously influence the target variable, providing insights into their interactions (Uyanik and Güler, 2013). Lastly, the K-Means Clustering Model, an unsupervised learning algorithm, was employed for exploratory data analysis and pattern discovery. Its scalability and efficiency present it to be particularly appealing for data analytics (Han et al., 2012).

The simple linear regression model produced a coefficient of 0, revealing that the independent variable does not affect the dependent variable; this may be due to a more complex relationship or data issues, such as missing values. The multiple regression analysis aimed to capture the combined effects of discounts, sales, and quantity on profit, providing insights into how these variables influence profit prediction. Despite accounting for these factors, the model indicated a discrepancy between predicted and actual profit, revealing a concerning trend of profitability declining notably, which was not within the expected scenario. This highlights the challenges in maintaining profit margins under discounting strategies.

The regression models required finding the ‘R-squared’ and ‘Adjusted R-squared values’. The results suggested a positive but weak linear relationship between discount, sales, quantity, and profit, demonstrating that while these variables can somewhat predict profit, other unaccounted factors likely influence profit variability.

Lastly, the K-Means Clustering Model visualised the dataset and categorised it by sales and profit. The scatter plot displays products sold, with sales represented on the X-axis and profit on the Y-axis. The clustering algorithm has grouped these products into distinct clusters, with the centroid (Cluster Centre) shown by a black point, and the separation between clusters indicating differences in performance. When interpreting the clusters, with the low-profit, low-sales clusters (Blue) – such as machines – these clusters exhibit low sales low sales profitability. They may represent underperforming or niche products, contributing minimally to overall profitability. In addition to this, there is the mid-range profit, mid-range sales clusters (Green) – including binders and phones – representing products with moderate sales and profitability, representing stable and consistent product performance. Products within this category could be improved for to improve efficiency and generate higher profit. A further category is noticed, the moderate-profit, low-sales cluster (purple) – including tables – showing a mid-range profit despite lower sales volumes. This could be attributed to factors such as seasonal trends or new products in the early stages of gaining traction. A final noticeable cluster point was the high-profit and high-sales clusters (pink) – including copiers and chairs – this cluster was characterised by both high profit and sales, representing the most successful and profitable product. These products are driving significant revenue and are likely to be key contributors to the company's overall profitability; therefore, capitalising on these products to sustain this growth could be valuable for improved profitability.

Conclusion

To conclude, valuable insights have been gathered through a comprehensive analysis of the ‘Superstore.csv.’ dataset to guide strategic decision-making and improve profitability. This report has provided a deeper understanding of the relationships between discounts, sales, quantity, and profit by employing various analytical techniques, including linear regression and K-means clustering. However, it must be recognised that while these analyses offer valuable insights, other unaccounted factors may influence results. Therefore, further exploration and analysis of additional variables are necessary for a more holistic understanding and prediction of profitability in this dataset.

Recommendations

• To invest in further data collection and analysis to identify additional factors influencing profitability, such as customer demographics and market trends • Advanced analytics techniques, including sentiment analysis and predictive modelling, to forecast market trends and consumer attitude changes accurately. • Enhance inventory management practices by leveraging clustering insights to identify high-demand products and seasonal trends and utilise inventory management tools to streamline procurement and distribution processes. • To optimise insights from regression analysis to implement dynamic pricing strategies, benefiting from machine learning algorithms to maximise revenue and profitability whilst remaining competitive in the market.

References:

Cui, Y. et al. (2022) ‘The influence of big data analytic capabilities building and education on business innovation’, Front Psychol. DOI: 10.3389/fpsyg.2022.999944.

Han, J., Kamber, M., Pei, J. (2012) Data Mining: Concepts and Techniques. 3rd Edition, Morgan Kaufmann.

Sharda, R. et al. (2020) Systems for Analytics, Data Science, and Artificial Intelligence: Systems for Decision Support, Global Edition. 11th Edition, Pearson.

Skiena, S.S. (2017) The Data Science Design Manual. Cham: Springer Nature.

Uyanik, K.G. and Güler, N. (2013) ‘A Study on Multiple Linear Regression Analysis’, Procedia – Social and Behavioral Sciences’, 106, pp.234-240.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}C11BD Big Data Analytics: Coursework 2

Introduction

Importing the Data

Re-labelling the Data

Cleaning the Data

Gap Values

Summary statistics

Plotting the data

Profitability Analysis by Discount Level

Profitability, Losses and Order Quantity

Profitable Sales Breakdown by Product Category

Modelling the Data

Simple Linear Regression Model: Predicting Profit based on Discount

Multiple Linear Regression Model: Actual vs Predicted Profit Comparison by Discount

K-Means Clustering Model

Conclusion

Recommendations

References:

C11BD Big Data Analytics: Coursework 2