Data Analytics Coursework 2

Shivaventhen Poobalan

H00456230

C11BD

Introduction

In this analysis, we will be exploring factors that contribute to the profitablity of the company by using data analysis techniques. The primary goal of this analysis is to provide insight into how the company can increase said profits by studying the various attributes found within the datset provided. For the purpose of this analysis, the attributes being focused on would be the sales, discount, and quantity, as these would be the key attributes that affect profits.

Data preperation would involve cleaning the data to removing outliers and handling missing values so that the data is not skewed in anyway, creating a Linear Regression model to predict the profitability, and a Random Forest Regressor to see if it better fits the dataset compared to the Linear Regression model

Step 1: Importing the Data

import pandas as pd #Load the dataset data = pd.read_csv('dataset_Superstore.csv') num_columns = len(data.columns) print(f'Number of columns: {num_columns}') data

Run to view results

The dataset contains 29 columns which include different types of data. The data types vary across numeric values, dates formatted as strings, and categorical data.

The data can also be classified into 2 main classifications. The first classification of data can be considered as identifiers, such as Row ID, Order ID, and Customer ID. The second classification of data is regarding order details, such as, Category, Sub-Category, Product Name, Sales, Quantity, Discounts, and Profits, as well as a Returned flag to show if an order is returned. Some of these data from the order details however can be classed together as the quantitative fields as it encompasses the financial side of the business, namely, Sales, Quantity, Discounts, and Profits. For this analysis, these are the more crucial attributes from the dataset as the aim of this analysis is to increase profitablity.

Cleaning the Data

Before proceeding with analysing a dataset, it is crucial to identify and clean the dataset. This is to ensure that any analysis done is not skewed by the missing data or outliers.

The cleaning process includes:

1) Checking for missing values - This is to identify and address any missing values within the dataset

2) Data type corrections - This is to make sure that data is input according to the right data types, ie: prices, dates, etc.

3) Detecting and removing outliers - Identifying the outliers in the quantitative fields as classified above (Sales, Quantity, Discount, and Profit). The method used for identifying outliers will be based on the statistical measures, such as Interquartile range (IQR) or Z-Score.

4) Checking for data entry errors - This is to look for any anomalies in the data fields that may display errors.

Checking for Missing Values and Data Type Corrections

import seaborn as sns import matplotlib.pyplot as plt #Step 1: Checking for missing values missing_values = data.isnull().sum() #Step 2: Data type corrections #Converting 'Order Date' and 'Ship Date' to datetime format data['Order Date'] = pd.to_datetime(data['Order Date'], dayfirst=True) data['Ship Date'] = pd.to_datetime(data['Ship Date'], dayfirst=True) #Step 3: Detecting and removing outliers, and checking for data entry errors # Identifying Missing data with Heatmap #Heatmap of missing values plt.figure(figsize=(12,6,)) sns.heatmap(data.isnull(), cbar=False, cmap='viridis') plt.title('Heatmap of Missing data') plt.show() missing_values, data.dtypes

Run to view results

Missing Values: There are no missing values in this dataset across all columns

Data Type Corrections: The 'Order Date' and 'Ship Date' columns have been successfully converted to the datetime format, which would make any time-based analysis easier.

Detecting and Addressing Outliers

#Defining a function for detecting outliers using the IQR method def detect_outliers_iqr(data, column_name): Q1 = data[column_name].quantile(0.25) Q3 = data[column_name].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = data[(data[column_name] < lower_bound) | (data[column_name] > upper_bound)] return outliers #Detect Outliers in 'Sales', 'Quantity', 'Discount', 'Profit' outliers_sales = detect_outliers_iqr(data, 'Sales') outliers_quantity = detect_outliers_iqr(data, 'Quantity') outliers_discount = detect_outliers_iqr(data, 'Discount') outliers_profit = detect_outliers_iqr(data, 'Profit') #Summary of Outliers outliers_summary = { "Sales Outliers": outliers_sales.shape[0], "Quantity Outliers": outliers_quantity.shape[0], "Discount Outliers": outliers_discount.shape[0], "Profit Outliers": outliers_profit.shape[0], } outliers_summary

Run to view results

The outlier detection process shows a breakdown of number of outliers in each column. The number of outliers are as follows:

Sales outliers: 1167 records identified as outliers

Quantity outliers: 173 records identified as outliers

Discount outliers: 858 records identified as outliers

Profit outliers: 1881 records identified as outliers

Approach for handling Outliers

Sales & Profit: These will be carefully handled to ensure that there are no extreme values that may skew the results of the analysis, and to avoid removing any data that may potentially be valuable to gain insights.

Quantity & Discount: These two datasets are more of an operational metrics, such as the number of items in an order and the discounts applied. Outliers in these fields may be more indicative of data entry errors, or plausibly exceptional cases. A closer inspection of these extreme outliers will be cruical to determine if they are reasonable within the business context, or if there needs to be any adjustment.

#Identifying Outliers import seaborn as sns import matplotlib.pyplot as plt #Plot for identifying outliers in Sales by Segment plt.figure(figsize=(10,6)) sns.boxplot(x='Segment', y='Sales', data=data) plt.title('Box Plot of Sales by Segment Before Cleaning') plt.show() #Plot for identifying outliers in Quantity by Segment plt.figure(figsize=(10,6)) sns.boxplot(x='Segment', y='Quantity', data=data) plt.title('Box Plot of Quantity by Segment Before Cleaning') plt.show() #Plot for identifying outliers in Discount by Segment plt.figure(figsize=(10,6)) sns.boxplot(x='Segment', y='Discount', data=data) plt.title('Box Plot of Discount by Segment Before Cleaning') plt.show() #Plot for identifying outliers in Profit by Segment plt.figure(figsize=(10,6)) sns.boxplot(x='Segment', y='Profit', data=data) plt.title('Box Plot of Profit by Segment Before Cleaning') plt.show()

Run to view results

The boxplots above provide a comprehensive visualisation of the distributions of each quantitative field, Sales, Quantity, Discount, and Profit, according to its segments, being consumer purchases, corporate purchases, and home office purchases. The reason the fields are plotted according to segments is to provide a better picture of the distribution of the data.

Before we proceed to remove the outliers for further analysis, it is important to understand why the outliers may have happened in the first place. Based on the boxplots we can see that outliers in profits is most likely to be caused by the outliers in Discount and Quantity. Meaning that when there is a Higher discount given, the quantity of items sold is also abnormally higher, which leads to abnormally higher profits as well.

However, there also is an outlier in the losses incurred by the business in the consumer segment. Considering that normal consumers would not buy in bulk as a corporate customer would, providing them with higher discounts may be what proved to be detrimental to profits in that particular segment.

Data Cleaning

When removing outliers, it is important to remove only those that are necessary in order to maintain the integrity of the data. In the case of this analysis, outliers from Quantity and Sales columns will be removed due to its high impact on the analysis. These outliers can significantly skew the analysis as it fetured directly influence the company's revenue and consequently, its profitability. Extremely large or small values in these fields might not be representative of the typical business operations and could lead to misleading conlusions about trends and patterns. These outliers may also represent special cases, such as bulk purchases or promotional events, which do not reflect the normal purchasing behaviour. By focusing on more typical transactions, the analysis aims to uncover patterns that are broadly applicable to the company's standard operations.

Outliers from Profit and Discount on the other hand will not be removed as outliers in profits incorporates a sense of business relaity. Profits, including losses, or negative profits, are intrinsic to the business's reality. Large variatons in profits are expected due to the diverse nature of products, sales agreements, customer segments, and so on. Signigicant proit or losses might reveal important information about what works well or what doesnt in the business model. High profit outliers may also identify particularly successful products or sales strategies, while significant losses could highlight problem areas or products that may need to be reevaluated.

As for discounts, the range of discounts applied to products can vary widely depending on the company's pricing strategies, promotional activities, and custimer negotiations. Since discounts directly affect sales and profitability, analysing the full range of applied discounts can offer insights into their impact on sales performances and profit margins.

# Cap Discount at 1.0 data['Discount'] = data['Discount'].clip(upper=1.0) # Function to remove outliers based on the IQR method def remove_outliers(df, column_name): Q1 = df[column_name].quantile(0.25) Q3 = df[column_name].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR return df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)] data_cleaned = remove_outliers(data, 'Quantity') data_cleaned = remove_outliers(data_cleaned, 'Sales') data_cleaned

Run to view results

original_size = data.shape[0] cleaned_size = data_cleaned.shape[0] removed_records = original_size - cleaned_size original_size, cleaned_size, removed_records

Run to view results

After removing outliers from the Quantity and Sales columns, the changes within the dataset are as follows:

Original Dataset Size: 9994 records

Cleaned Dataset Size: 8685 records

Removed Records: 1309 records

It is crucial to consider the objectives of the analysis when deciding which outliers to remove, and the potential impact of retaining or removing these data points. For this analysis, the approach was chosen to maintain as much relevant information as possible while ensuring the dataset reflects typical transactions that are the most useful for identifying broad trends and patterns affecting probability.

Calculating Summary Statistics

With the cleaned data, we compute summary statistics to gain insights into the central tendency, dispersion, and shape of the dataset's features.

# Calculate summary statistics of the cleaned data print(data_cleaned.describe(include='all'))

Run to view results

Numerical Fields Summary:

Sales: The average sales amount is $91.25, with a standard deviation of $112.26, showing a wide range of sales values. The maximum sales value after cleaning is $489.23, greatly reduced from the original dataset before removing outliers.

Quantity: The average quantity of items sold per transaction is 3.49, with a standard deviation of 1.91, meaning the quantity of items sold per transaction is clustered tightly around the mean and there isnt much deviation. The maximum number of quantity of items sold per transaction is 9, which is a fair bit more than the average.

Discount: The discounts applied average at 15.8% with a standard deviation of 21.2%. The maximum discount given however is 100%, which is an extreme outlier, and could either be an error, or possibly part of a sales strategy, such as a buy-1-free-1, in which case one product would be getting a 100% discount applied.

Profit: The average profit per transaction is $10.66 with a standard deviation of $48.56, which shows a wide range in profit numbers. The maximum profit is $239.99, however the minimum profit, or the highest loss incurred, is $1,181.28, which is a very big loss incurred and could potentially be an outlier, or the result of a sales strategy that went horribly wrong.

Plotting the Data

Visualisations are created to better understand the distribution and relationships within the data. A bar chart shows the distribution of transactions across customer segments, and a scattor plot illustrates the relationship between sales and profit.

For this analysis, visualising and understanding the distribution of transactions across customer segments is crucial as it can provide insights into the customer base and how different segments contribute to the overall transaction volume. This information is important for tailoring marketing strategirs, product offerings, and customer service to the most prevalent segments.

Exploring how the sales figures related to profits on the other hand is fundamental to understanding the company's profitability dynamics. By examining this relationship, we can identify trends such as whether higher sales volumes consistently lead to higher profits, or if there are diminishing returns or even losses associated with certain sales levels. This analysis can guide pricing strategies, cost management, and product focus.

import matplotlib.pyplot as plt import seaborn as sns # Set the aesthetic style of the plots sns.set_style("whitegrid") # Bar chart of 'Segment' distribution plt.figure(figsize=(10, 6)) sns.countplot(data=data_cleaned, x='Segment') plt.title('Distribution of Transactions Across Customer Segments') plt.xlabel('Customer Segment') plt.ylabel('Number of Transactions') plt.show() # Scatter plot of 'Sales' vs. 'Profit' plt.figure(figsize=(10, 6)) sns.scatterplot(data=data_cleaned, x='Sales', y='Profit', hue='Segment', style='Returned', alpha=0.6) plt.title('Relationship Between Sales and Profit') plt.xlabel('Sales') plt.ylabel('Profit') plt.legend(title='Segment / Returned') plt.show()

Run to view results

Insights from the Graphs

Distribution of Transactions Across Customer Segments (Bar Chart)

From the bar chart above, we can confidently say that the consumer segment makes the most transaction among the three customer segments, almost double that of the second highest contributors, the corporate segment, followed by the lhome office segment that has the least distribution of transactions.

Relationship between Sales and Profit (Scatter Plot)

The scatter plot, which explores the relationship between sales and profits, shows a generally positive correlation among most sales and profits, as shown by how the profit increases as the sales increase. However, there is also a wide varience in the profits within the same slaes numbers as well. This shows that other factors, such as product mix, discounts, and other operational anomalies may have led to highly affecting the profit margins.

Choosing and Applying a Modelling Strategy

I have selcted a Linear Regression model to predict profitability, justifying it as a good fit for continuous outcome prediction and its interpretability. The data is prepared by selecting relevant features and splitting it into training and testing sets.

from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score # Prepare the data for modeling X = data_cleaned[['Sales', 'Quantity', 'Discount', 'Segment_no']] y = data_cleaned['Profit'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train the Linear Regression model lr_model = LinearRegression() lr_model.fit(X_train, y_train) # Make predictions and evaluate the model y_pred = lr_model.predict(X_test) print('MSE:', mean_squared_error(y_test, y_pred)) print('RMSE:', mean_squared_error(y_test, y_pred, squared=False)) print('R-squared:', r2_score(y_test, y_pred))

Run to view results

The results of the Linear Regression model provides insights into its performance in predicting the Profit based on features like sales, quantity, discount, and the segments.

Mean Squared Error (MSE): 1642.2233

MSE measures the average squared difference between the estimated values and the actual value. In this context, and MSE of 1642.233 means that on average, the model's predictions deviate from the actual proftis by the square root of this value (RSME). The MSE gives an idea of the magnitude of errors, but because it is in squared units of the targe variable (profits), it is not immediately intuitive.

Root Mean Squared Error (RSME): 40.5245

RSME is the square root of MSE and provides a measure of the average deviation of the model's predictions from the actual profits in the same unit as the target variable (profits). An RSME of 40.5245 means that, on average, the model's profit predictions are about $40.52 off from the actual profit values. This metric is particularly useful because it gives an error magnitude that can be directly related to the target variable's units, making it easier to interpret the model's accuracy.

R-Squared (R^2): 0.223

R^2 measures the proportin of the variance in the dependent variable that is predictable from the independent variables. With an R^2 of 0.223, this model explains about 22.3% of the variance in profits. This suggests that the features included in the model (Sales, Quantity, Discount, and Segment_no) provide some insight into the profit, byt a significant portion of variability in profit is not captured by these variables. This could imply that there are other factors affecting profit that are not included in the model or that the relationship between the predictors and profit is not entirely linear.

Using Random Forest Regressor for Cross-Validation

To further validate the model's performance and improve upon the insights gained from Linear Regression, we can use the Random Forest Regressor as it is a powerful ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the mean prediction of the individual trees. Random Forests can capture complex nonlinear relationships between features and the target variables, potentially providing better accuracy than linear models. Random Forest is less prone to overfitting compared to indivirual decsion trees because it averages the results of many trees. It can also handle a mix of numerical and categorical features and does not require scalling of data.

Preparing the Data

Data preperation steps do not need to be altered from the ones used for the linear regression model. 'X_train', 'X_test', 'y_train', and 'y_test' datasets can be used again.

from sklearn.ensemble import RandomForestRegressor # Initialize the Random Forest Regressor rf_model = RandomForestRegressor(n_estimators=100, random_state=42) # 100 trees # Fit the model on the training data rf_model.fit(X_train, y_train) # Predict on the test set y_pred_rf = rf_model.predict(X_test) # Evaluate the model mse_rf = mean_squared_error(y_test, y_pred_rf) rmse_rf = mse_rf ** 0.5 r2_rf = r2_score(y_test, y_pred_rf) print(f'Random Forest MSE: {mse_rf}, RMSE: {rmse_rf}, R^2: {r2_rf}')

Run to view results

# Get feature importances importances = rf_model.feature_importances_ # Summarize feature importances for feature, importance in zip(X.columns, importances): print(f'{feature}: {importance}') # Plotting feature importances plt.figure(figsize=(10, 6)) sns.barplot(x=importances, y=X.columns) plt.title('Feature Importance in Random Forest Model') plt.xlabel('Importance') plt.ylabel('Feature') plt.show()

Run to view results

Model Performance Metrics:

Mean Squared Error (MSE): 470.7029

The MSE is significantly lower compared to that of the Linear Regression, which means that this model has a better fit to the data.

Root Mean Squared Error (RMSE): 21.6957

The RSME of 21.6957 suggests that, on average, the model's predictions are within approximately +/- 21.6957 units of the actual profit values.

A lower RSME value relative to the dataset's target variable range indicates a more accurate model.

R-Squared (R^2): 0.7773

An R^2 value of 0.7773 means that approximately 77.73% of the varianve in profit can be explained by the model. This is a substantial improvement over the linear model, indicating a much better fit to the data.

Features Importances

The feature importances provides an insight into which variables are most influential in predicting the target variable, which in the case of this analysis is the Profits.

1) Sales (0.4453): This value suggests that the 'Sales' feature is responsible for about 44.53% of the decision-making in the model. It's a significant driver of profit variability.

2) Quantity (0.0505): 'Quantity' has a much smaller impact, contributing to only about 5.05% of the predicitive power. It indicates that while quantity sold has some influence, it is not as crucial as other factors.

3) Discount (0.4854): 'Discount' is the most influential feature, accounting for 48.54% of the model's decision-making. This highlights the critical role that discounting plays in affecting profitability, suggesting that careful management of discount strategies could significantly impact the company's profits.

4) Segment_no (0.0188): This feature, which represents the customer segment, has a minimal influence on predicting profit, contributing to only about 1.88% of the model's predictions. It suggests that the segment of the customer is not a major determinant of profitability for individual transactions.

How to Proceed

Given the insights from the Random Forest model:

Strategic Implications: The dominant influence of Sales and Discount on profitability suggests that the company should closely monitor its pricing and discounting strategies to optimize profit. While aiming to increase sales, the company should also find the optimal discount level that maximizes profit without negatively impacting sales volume.

Further Analysis: Consider exploring more nuanced relationships between these features and profit. For instance, there might be threshold effects for discounts beyond which profitability significantly drops.

Model Refinement: You could refine the model by tuning hyperparameters or exploring interactions between features. Additionally, considering other models or ensemble methods could further improve predictive performance.

Operational Changes: Based on the model's findings, operational adjustments such as targeted marketing strategies, dynamic pricing models, or customer segmentation approaches could be developed to enhance profitability.

Findings and Recommendations

1. Optimize Discount Strategies

Finding: Discounts have a significant impact on profitability, accounting for about 48.54% of the model's decision-making. Recommendation: Implement a more strategic approach to discounting. Analyze the elasticity of demand for different products and adjust discounts to optimize profit margins. Consider limiting discounts on high-demand products while leveraging them to increase the volume on slower-moving items, ensuring overall profitability is maintained or enhanced.

2. Focus on Sales Enhancement Tactics

Finding: Sales contribute to approximately 44.53% of the decision-making in predicting profit. Recommendation: Develop and implement strategies to boost sales, such as targeted marketing campaigns, loyalty programs, and personalized promotions. Additionally, focus on cross-selling and upselling to increase the average transaction value. Enhancing online and in-store shopping experiences can also lead to increased sales.

3. Re-evaluate Product and Pricing Mix

Finding: The significant influence of sales and discounts on profitability underscores the importance of the product mix and pricing strategies. Recommendation: Conduct a comprehensive review of the product portfolio to identify high-margin products and prioritize them in sales and marketing efforts. Adjust pricing strategies based on market demand, competition, and cost considerations to ensure competitive yet profitable pricing.

4. Strategic Customer Engagement

Finding: Customer segment (Segment_no) has a minimal direct impact on profitability according to the model, but understanding customer behavior and preferences can indirectly influence sales and discount strategies. Recommendation: Deepen customer engagement through personalized communication and tailored offerings. Utilize customer data analytics to understand buying patterns and preferences, enabling more effective segmentation and targeted marketing that can drive sales without overly relying on discounts.

5. Continuous Monitoring and Analysis

Recommendation: Establish a continuous feedback loop where sales data, along with discount and profitability metrics, are regularly analyzed. This approach enables the business to quickly identify trends, adjust strategies in real-time, and capitalize on emerging opportunities to enhance profitability.

6. Invest in Data Analytics Capabilities

Recommendation: Strengthen data analytics and business intelligence capabilities to facilitate more sophisticated analyses of sales, discounts, customer behavior, and profitability. Investing in these areas can provide deeper insights, supporting more informed decision-making and strategy development.

7. Experimentation and Learning

Recommendation: Adopt a culture of experimentation where different sales and discount strategies are tested in controlled environments. Use A/B testing or other testing methodologies to evaluate the effectiveness of different approaches, learning from the outcomes to continuously refine strategies.

Conclusion

The analysis conducted using Linear Regression and Random Forest Regressor models on the Superstore dataset has provided valuable insights into the factors influencing profitability. Our findings highlight the significant impact of sales and discount strategies on profit margins, underscoring the importance of carefully managing these aspects to enhance profitability. While sales volume is crucial, the strategic application of discounts plays a pivotal role in optimizing profits. The minimal direct impact of customer segments on profitability suggests that while understanding customer preferences is important, the key levers for improving profitability lie in how sales and discounts are managed.

Recommendations for the business focus on optimizing discount strategies, enhancing sales through targeted marketing and customer engagement, re-evaluating the product and pricing mix, and investing in data analytics capabilities. By implementing these strategies, the business can expect to see improved profitability, better customer satisfaction, and a stronger competitive position in the market.

This analysis demonstrates the power of data analytics in uncovering actionable business insights and highlights the potential for strategic decision-making based on a deep understanding of data-driven trends and patterns.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Data Analytics Coursework 2

Introduction

Step 1: Importing the Data

Cleaning the Data

Checking for Missing Values and Data Type Corrections

Detecting and Addressing Outliers

Approach for handling Outliers

Data Cleaning

Calculating Summary Statistics

Numerical Fields Summary:

Plotting the Data

Insights from the Graphs

Distribution of Transactions Across Customer Segments (Bar Chart)

Relationship between Sales and Profit (Scatter Plot)

Choosing and Applying a Modelling Strategy

Mean Squared Error (MSE): 1642.2233

Root Mean Squared Error (RSME): 40.5245

R-Squared (R^2): 0.223

Using Random Forest Regressor for Cross-Validation

Preparing the Data

Model Performance Metrics:

Mean Squared Error (MSE): 470.7029

Root Mean Squared Error (RMSE): 21.6957

R-Squared (R^2): 0.7773

Features Importances

How to Proceed

Given the insights from the Random Forest model:

Findings and Recommendations

1. Optimize Discount Strategies

2. Focus on Sales Enhancement Tactics

3. Re-evaluate Product and Pricing Mix

4. Strategic Customer Engagement

5. Continuous Monitoring and Analysis

6. Invest in Data Analytics Capabilities

7. Experimentation and Learning

Conclusion

Data Analytics Coursework 2