# Big Data Analytics Coursework 2

Radhika Patil - 13-03-2024

## Introduction

Following Consultancy Report is aimed at driving profitability for given Superstore. The analysis is done by deep diving into the historical sales records of the superstore and using advanced analytical techniques and domain understanding to transform complex data patterns into actionable insights, underlining the factors that substantially influence profitability. The data has been meticulously cleansed, and examined to highlight elements that catalyse profit maximisation and to identify under performing areas where focused improvements can facilitate growth.

## Data Preparation

## Data Cleaning

There are no missing values in the data. The data types seem to be accurate.

Column 'Country' can be ignored as it has only one unique value. There seem to not be any data entry error.

### Outliers

We move on to outlier detection and removal as this step is crucial for maintaining data integrity and model accuracy, making the analyses more reliable.

There are clear outliers where quantity is 10000.

The plot's density appears to be uniform across different quantities, but there are some quantities with notably more transactions, as indicated by denser clusters of points (e.g., quantities 2, 3, 5, and 7).

Presence of outliers is evident from above graph.

The data seems to be more normally distributed with very high peak. Therefore, mean should be used to detect outliers.

There's a dense concentration of points around the profit range slightly above and below zero units. This suggests that most transactions result in a small profit or loss.

There seems to be noisy data with discounts exceeding 100%.

The scatter of points at higher discount levels, such as 0.5 (50%) and above, suggests that these substantial discounts are not rare occurrences. Such heavy discounting could potentially have a large impact on profitability .There is also a significant number of sales with no discount applied, as indicated by the dense vertical band at the 0.0 discount level.

Outliers seem to be present after 5000 units of sales.

As the data distribution is skewed, median should be used for outlier handling.

The concentration of points at the lower end of the sales range might suggest focusing on strategies to either increase the frequency of higher-value sales or enhance the profitability of the more common, lower-value transactions.

## Understanding Data

## Summary Statistics

Sales: A relatively high standard deviation of 424.50 compared to the mean of 209.43 suggests significant variability in sales amounts, with some transactions being much higher or lower than the average. The 25th, 50th (median), and 75th percentiles suggest a right-skewed distribution of sales, with most transactions being on the lower end of the scale.

Quantity: On average, each transaction includes about 3.78 items with a small standard deviation of 2.22 relative to the mean which suggests most transactions include a small number of items.

Discount: Discounts range from 0% (no discount) to 80% (a significant discount), which could indicate clearance sales or special promotions.

Profit: The large standard deviation of 132.10 compared to the mean profit of 25.50 suggests high variability in profitability, from losses to significant gains. The 50th percentile (median) is lower than the mean, which again indicates a right-skewed distribution with a few high-profit transactions pulling the average above the median.

The summary suggest a focus on ‘Standard Class’ shipping and the ‘Consumer’ segment to maximise profits, as they are most frequent. Leveraging ‘Office Supplies’ in high-volume locations like ‘California’ and ‘New York City’ could be key. Prioritising product offerings such as ‘Binders’ and ‘Staple envelopes’, which are popular, may improve profitability.

## Exploratory Data Analysis

### Evaluating Product Performance

Most sub-categories are profitable, with two showing exceptionally high profits. Phones and Accessories stand out with the highest profit, suggesting key drivers of profitability and should be a focus for sales and marketing efforts.

Conversely, some sub-category are significantly under performing, incurring large losses. These products requires immediate attention to identify issues related to cost, pricing, or demand.

Strategic actions could include promoting high-profit sub-categories, re-evaluating the pricing strategy, and possibly discontinuing or revamping the loss-making sub-category to optimise overall profit.

### Highlighting Loss making locations

It is important to recognise loss making areas so that campaigns or promotional activities can be carried out there to expand business and sustain profits.

### Studying Categorical variables against profit

The data reveals the Consumer segment, West region, and Technology category as the most profitable areas, suggesting targeted investment and expansion there.

The Corporate segment and East region show potential for growth, warranting further analysis and tailored strategies.

The Home Office segment, Central and South regions, and Furniture and Office Supplies categories lag behind, necessitating a review of operations, cost structures, and market strategies.

### Region wise Category analysis for Profits

The West region consistently shows strength across all categories, hinting at a successful regional strategy.

The disparity in profits, especially in office supplies, suggests opportunities for cross-regional learning and strategy adaptation. The furniture category may need a comprehensive review of cost, sales strategy, or customer preference to enhance profitability.

Technology's balanced profit suggests stable demand but also hints at the potential for targeted growth strategies in specific regions.

Furniture category in the Central region seem to be a problem area as it loss making, demanding review.

### Region wise Segment analysis for Profits

The West region is a strong performer across all segments, indicating effective regional strategies.

Consumer and Corporate segments show potential for growth in the Central and South regions.

The more balanced distribution in the Home Office segment suggests different market dynamics or competitive advantages that may be unique to that segment. These insights could guide regional strategy optimisations and resource allocations.

## Derived Features

Derived features, engineered from existing data, are crucial for enhancing machine learning model performance. They can unveil hidden insights by capturing additional information not represented by the raw features, thereby improving model accuracy and aiding in the discovery of more complex patterns in data.

## Correlation

### Raw Continuous Variable Correlation with Profit

Sales vs Profit: Higher sales seem to correlate with increased profit, but there are instances of high sales with low or negative profit, suggesting that higher sales do not always guarantee higher profits.

Discount vs Profit: There is no clear trend suggesting that higher discounts lead to higher profits. In fact, larger discounts seem to occasionally result in significant losses.

Quantity vs Profit: Similar to discounts, there isn’t a straightforward correlation between quantity sold and profit. While higher quantity sales often have positive profit, some high-quantity sales result in losses.

### Derived Continuous Variable Correlation with Profit

->There is a strong positive correlation between profit and profit margin, which is expected since higher margins typically lead to higher profits. ->Product popularity has a moderately negative correlation with profit and profit margin, suggesting that more popular products might not always be the most profitable, possibly due to competitive pricing. ->Sales per order have a noticeable positive correlation with order size, indicating that larger orders tend to generate higher sales. ->Shipping time does not show a strong correlation with profit, suggesting that the efficiency or speed of shipping may not significantly impact profitability. This information can guide strategies to enhance profit margins and reconsider product mix to boost profitability, potentially focusing on less popular but more profitable items, and targeting larger order sizes.

### Categorical Variable Correlation with Profit

The spread of the profit data within each category, particularly the inter-quartile range, suggests there is significant variability in profitability across categories. This could imply that the category a product belongs to might influence the variability of its profitability.

Technology is potentially the most profitable category but with a high variability, suggesting some sales are highly profitable while others may not be. Furniture has the lowest median profitability and also a wide range of profit outcomes, indicating inconsistency in profits. Office Supplies, while not achieving the high-profit values of Technology, indicates steady and consistent profits. To improve overall profitability, strategies could focus on increasing sales of high-margin technology products while improving the profit consistency in Furniture and maintaining the steady gains in Office Supplies.

## Modelling

### Customer Analysis using RFM

Clustering Customers to understand how various group influences profit and by how much.

The Calinski-Harabasz Index, also known as the Variance Ratio Criterion, is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters. Higher values generally indicate a model with better defined clusters. A score of 523.91 suggests that the clustering model has done a reasonably good job of creating well-defined and separated clusters for the dataset.

### Studying customer relation with profit.

Cluster 0's low-profit margin implies customers with infrequent or small purchases, suggesting a need for strategies to boost their spending.

Cluster 1 shows variable profits, with certain customers making occasional high-value purchases, indicating an opportunity for targeted promotions to increase their shopping frequency.

Cluster 2, with the highest median profit, consists of the superstore's most valuable customers, who are likely frequent buyers or spend significant amounts per transaction; these customers are prime candidates for loyalty and retention programs. Outliers in Clusters 0 and 1, where profits sometimes dip into losses or surge, require further analysis to minimise losses and capitalise on high-profit sales.

Tailoring strategies to each segment's behaviour and value can elevate the superstore’s profitability, by enhancing customer engagement and optimising sales tactics.

## Modelling

In order to understand key drivers behind a superstore's profitability, various models were employed including decision trees and linear regression. However, most models gave high Mean Squared Error (MSE) and low R-squared (R^2) values. Low scores while using linear regression, in particular encouraged investigation w.r.t. the nonlinear dynamics within the data. Consequently, employing Polynomial Features Transformation enhanced the models' performance giving best results at degree 2 hinting at presence of relationships of features influencing profitability beyond linear interactions.

The metrics indicate a linear regression model with polynomial features of degree 2 performed well in predicting outcomes, evidenced by an MSE of 1405.54 and an R^2 of 0.919. The relatively low MSE suggests predictions are close to actual values, indicating good model accuracy. The high R^2 value signifies that approximately 91.9% of the variance in the dependent variable is explained by the model, showcasing its effectiveness in capturing the relationship between variables. This model's success implies that incorporating polynomial features has significantly enhanced its predictive power, making it a robust tool for forecasting and decision-making.

Model coefficients in linear regression indicate the strength and direction of the relationship between each predictor variable and the target variable. A coefficient shows how much the target variable is expected to change when the predictor variable changes by one unit, holding other variables constant.

The magnitude of a coefficient indicates the strength of the relationship between the corresponding feature and the target variable while the sign indicates direction.

Conclusion: By evaluating the intricate relationships between product categories, discounts, customer demographics, and purchasing behaviours, specific areas that can be leveraged to amplify the store's financial performance have been highlighted. Overall, leveraging strengths in high-profit areas while improving under performing ones through strategic adjustments could drive comprehensive business growth and enhance profitability.

References:

Barnett, V. and Lewis, T., 1994. Outliers in Statistical Data. 3rd ed. Chichester: John Wiley & Sons

Hughes, A.M., 1996. ‘Recency, Frequency, and Monetary Value in: Selection of Direct Marketing Customers’, International Journal of Selection and Assessment, 4(3), pp. 114-123

Sharda, R., Delen, D., & Turban, E. (2020). Analytics, data science, & artificial intelligence: Systems for decision support. Pearson Education, Inc. 11th Edition.