C11BD - CW2_Riya Sinha

C11BD-CW2

Submitted by Riya Sinha, H00439236

https://deepnote.com/app/c11bd-1bf7/C11BD-CW2Riya-Sinha-e3c4ac3d-02d6-4048-8b3d-2341ea14dc31

Introduction

This analysis focuses on evaluating sales and profitability trends across various customer segments, product categories, and geographical regions, utilizing a dataset from a superstore. The methodology integrates descriptive statistics, exploratory data analysis, and machine learning models, namely Linear Regression and Random Forest, to predict profitability. Insights derived from this analysis are aimed at informing business strategies for enhancing profitability and identifying key drivers of sales performance.

Methodology

The methodology began with importing and cleaning the dataset, including handling missing values and outliers. Exploratory Data Analysis (EDA) was conducted to understand sales and profit trends, customer segments, product categories, and geographical performance through visualizations like box plots, line charts, and heatmaps. Two predictive models, Linear Regression and Random Forest, were then developed and evaluated based on Mean Squared Error (MSE) and R-squared (R²) to predict profitability. The Random Forest model showed a significant improvement over Linear Regression, indicating its better suitability for capturing the complexity of the dataset.

Importing the dataset

Let's start by loading the dataset and taking a quick look at its first few rows to understand its structure and identify the columns available.

#Importing neccessaries libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the dataset into a pandas DataFrame df = pd.read_csv('dataset_Superstore.csv') df

Run to view results

df.shape

Run to view results

#Displaying the first few rows to get an overview of the data structure and the types of data it contains df.head()

Run to view results

The dataset contains various columns, including Row ID, Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer Name, Segment, and several others related to product details, sales, and returns.

# Display a concise summary of the DataFrame # It contains information like the total number of entries, the total number of columns, each column's data type, and the number of non-null values # It also provides the DataFrame's memory usage, which helps to understand the dataset's size and structure df.info()

Run to view results

Data Preparation and Cleaning

Here's a brief overview of the steps we'll take for data cleaning:

Identifying Null Values: We'll check for any missing values in the dataset.

Removing Duplicate Values: We'll look for and eliminate any duplicate rows.

Date-Time Formatting: We'll ensure that the Order Date and Ship Date columns are in the correct datetime format.

Summary Statistics Pre-Cleaning: We'll provide a summary of the dataset before cleaning.

Cleaning Process: We'll perform the cleaning steps identified.

Summary Statistics Post-Cleaning: We'll summarize the dataset after cleaning to show what has been changed or improved.

# Identify null values null_values = df.isnull().sum() #Check for any duplicate values duplicate_Val= df.duplicated() # Check for duplicate rows duplicate_rows = df.duplicated().sum() # Check data types, especially for 'Order Date' and 'Ship Date' data_types = df.dtypes null_values, duplicate_Val, duplicate_rows, data_types

Run to view results

There are no null values in any of the columns, which means we don't need to perform any imputation for missing data.

No duplicate values or rows were found, so there's no need to remove any duplicates at this stage.

The Order Date and Ship Date columns are currently recognized as objects (strings) instead of datetime objects. We'll need to convert these to datetime format.

#Summary statistics of the dataset before cleaning df.describe()

Run to view results

# Convert 'Order Date' and 'Ship Date' to datetime format df['Order Date'] = pd.to_datetime(df['Order Date'], dayfirst=True) df['Ship Date'] = pd.to_datetime(df['Ship Date'], dayfirst=True) # Re-check data types to confirm conversion data_types_after_conversion = df[['Order Date', 'Ship Date']].dtypes data_types_after_conversion

Run to view results

The conversion of Order Date and Ship Date to datetime format was successful, and the dataset's data types for these columns are now datetime64[ns].

#Summary statistics post-cleaning to highlight changes df.describe()

Run to view results

At this point, the dataset is cleaned with respect to null values, duplicates, and date formatting.

Explorating Data Analysis

Box Plots

To check for outliers, we'll focus on the numerical columns: Sales, Quantity, Discount, and Profit. We can use box plots to identify outliers in these columns.

#Using box plots to visualize outliers for sales, quantity, discount, and profit fig, axes = plt.subplots(2, 2, figsize=(16, 12)) # Titles for plots titles = ['Sales', 'Quantity', 'Discount', 'Profit'] # Data for plots df_columns = [df['Sales'], df['Quantity'], df['Discount'], df['Profit']] for ax, column, title in zip(axes.flatten(), df_columns, titles): sns.boxplot(ax=ax, x=column) ax.set_title(f'Box Plot of {title}') plt.tight_layout() plt.show()

Run to view results

The box plots for Sales, Quantity, Discount, and Profit show a significant number of outliers, especially in the Sales and Profit columns. These outliers are points that fall well outside the upper and lower whiskers of the box plots, indicating that they are significantly higher or lower than the majority of the data points.

Sales and Profit columns exhibit a wide range of outliers, suggesting some transactions are exceptionally high or low compared to the typical range of sales and profits.

Quantity and Discount columns also display outliers, but they are relatively fewer and closer to the upper whisker.

Interquartile Range (IQR)

After visualising, Interquartile Range (IQR) method is used to identify outliers numerically, we'll calculate the IQR for each of the specified variables: Sales, Quantity, Discount, and Profit (Tukey, J. W., 1977). Outliers are typically defined as data points that fall below Q1−1.5 X IQR or above Q3+1.5 X IQR where Q1 and Q3 are the first and third quartiles, respectively (Hubert & Vandervieren, 2008).

# Calculating IQR for Sales, Quantity, Discount, and Profit iqr_data = {} outliers_count = {} for column in ['Sales', 'Quantity', 'Discount', 'Profit']: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Calculating the number of outliers outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)] outliers_count[column] = len(outliers) # Storing IQR and bounds iqr_data[column] = {'IQR': IQR, 'Lower Bound': lower_bound, 'Upper Bound': upper_bound} iqr_data, outliers_count

Run to view results

Removing Outliers

We will not remove outliers from 'Sales' and 'Profit' as it may result in the loss of key insights about best-performing categories or most profitable transactions. These outliers can also provide information about what works best for the business or help us identify greatest opportunites or challenges.

By only removing outliers from Discount and Quantity, we can clean data that supports accurate analysis and decision-making for typical business operations, while keeping the insightful data that extreme Sales and Profits values can provide.

# Calculating IQR, lower bound, and upper bound for 'Quantity' and 'Discount' for column in ['Quantity', 'Discount']: Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Filtering the dataset to remove outliers df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)] # Display the shape of the dataset after removing outliers df.shape

Run to view results

Sales and Profit Trends Chart

To create a visualization showing sales and profit trends over time, we will use line charts (Few, S., 2009). Line charts are ideal for this purpose because they clearly depict how values change over a period, allowing us to observe trends, patterns, and fluctuations in sales and profits across the dataset's timeframe (Tufte, E. R., 2001).

# Aggregate sales and profit by month monthly_data = df.resample('M', on='Order Date')['Sales', 'Profit'].sum() # Plotting plt.figure(figsize=(12, 6)) # Sales trend plt.plot(monthly_data.index, monthly_data['Sales'], label='Sales', color='blue') # Profit trend plt.plot(monthly_data.index, monthly_data['Profit'], label='Profit', color='red') plt.title('Sales and Profit Trends Over Time') plt.xlabel('Month') plt.ylabel('Amount') plt.legend() plt.xticks(rotation=45) # Rotate dates for better legibility plt.tight_layout() plt.show()

Run to view results

It is observed that Profits remain positive throughout the period, which is a good sign of overall business health.

The lack of a direct proportionate increase between sales and profit indicates that higher sales do not always translate into equivalently higher profits. This could be due to various factors such as higher costs, discounting strategies, or a sales mix skewed towards less profitable products.

There are a few sharp spikes in sales that do not have corresponding spikes in profit. For instance, a spike in sales around the start of 2015 and mid-2016 does not result in a noticeable increase in profit. This proposes an investigation into sales activities or promotions that drive volume without enhancing profitability.

Profitablility Analysis

1. Average Profit by Category - To compare the average profit by product category.

# Group by 'Category' and calculate the average profit for each category average_profit_by_category = df.groupby('Category')['Profit'].mean().reset_index() average_profit_by_category

Run to view results

sns.set(style="whitegrid") # Create a bar chart plt.figure(figsize=(10, 6)) barplot = sns.barplot(x='Category', y='Profit', data=average_profit_by_category) # Add labels to the top of the bars for p in barplot.patches: barplot.annotate(format(p.get_height(), '.2f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 9), textcoords = 'offset points') # Add a legend and a descriptive title plt.legend(title='Category') plt.title('Average Profit by Category') # Include the y-axis label plt.ylabel('Average Profit ($)') plt.xlabel('Category') plt.show()

Run to view results

As we analyse, Technology is the most profitable category with an average profit of $89.33, suggesting an increase in resources like inventory, marketing, or product expansion could be beneficial.

Office Supplies, with an average profit of $30.27, ranks second, over double Furniture's profit but far less than Technology. Exploring the product mix for higher-margin items could boost profitability.

Furniture has the lowest profit at $13.81, indicating a need to review pricing, costs, sales strategies, and competition.

2. Total Profit by Sub-Category:

# Group by 'Sub-Category' and calculate the total profit for each sub-category total_profit_by_subcategory = df.groupby('Sub-Category')['Profit'].sum().sort_values(ascending=False).reset_index() total_profit_by_subcategory

Run to view results

# Plotting the total profit by sub-category as a vertical bar chart plt.figure(figsize=(15, 10)) barplot = sns.barplot(x='Sub-Category', y='Profit', data=total_profit_by_subcategory, color="skyblue") # Add labels to each bar with the profit value for p in barplot.patches: height = p.get_height() # Adjust the position of the annotation based on the value (positive or negative) if height > 0: # For positive values, place label above the bar vertical_position = height vertical_alignment = 'bottom' xytext = (0, 3) else: # For negative values, place label below the top of the bar vertical_position = p.get_y() vertical_alignment = 'top' xytext = (0, -12) barplot.annotate(f'${height:,.2f}', (p.get_x() + p.get_width() / 2., vertical_position), ha='center', va=vertical_alignment, fontsize=10, xytext=xytext, textcoords='offset points') # Set plot title and labels plt.title('Total Profit by Sub-Category') plt.xlabel('Sub-Category') plt.ylabel('Total Profit ($)') # Rotate the x-axis labels for better readability plt.xticks(rotation=45) plt.tight_layout() plt.show()

Run to view results

The sub-categories 'Binders', 'Copiers', and 'Phones' are the top performers in terms of profit generation. 'Binders' lead with a significant margin, followed by 'Copiers' and 'Phones', which suggests that these areas are the most profitable for the business.

Several sub-categories such as 'Envelopes,' 'Art,' 'Labels,' 'Fasteners,' 'Bookcases,' 'Supplies,' and notably 'Tables' have low profitability. 'Tables' uniquely incur losses, significantly affecting total profit of -$15,822.

The profitability gap among sub-categories suggests differences in pricing, costs, demand, and competition. Addressing the loss in 'Tables' is crucial for enhancing overall profitability.

Customer Segment Analysis

1. To represent the proportion of sales coming from different customer segments with the help of pie chart.

2. To identify which customer is contributing the most to the sales through bar chart.

# Group by 'Segment' and calculate the total sales for each segment total_sales_by_segment = df.groupby('Segment')['Sales'].sum().reset_index() total_sales_by_segment

Run to view results

# Plotting the total sales by customer segment as a pie chart plt.figure(figsize=(10, 8)) plt.pie(total_sales_by_segment['Sales'], labels=total_sales_by_segment['Segment'], autopct='%1.1f%%', startangle=140, colors=['lightblue', 'lightgreen', 'lightcoral']) # Add total sales in dollars to the center of the pie total_sales = total_sales_by_segment['Sales'].sum() plt.text(0, 0, f'Total Sales\n${total_sales:,.2f}', ha='center', va='center', fontsize=12) # Set the chart title plt.title('Total Sales by Customer Segment') # Display the legend plt.legend(title='Segments') # Show the plot plt.tight_layout() plt.show()

Run to view results

The Consumer segment accounts for the majority of sales with 50.1%, Corporate comes next with 31.2%, and Home Office has the smallest share at 18.7%. The overall total sales amount to $2,131,828.31. This distribution indicates that marketing and sales efforts might be most effectively targeted at the Consumer segment, given its dominance in sales contribution.

# Group by 'Customer Name' and calculate the total sales for each customer total_sales_by_customer = df.groupby('Customer Name')['Sales'].sum().sort_values(ascending=False).reset_index() total_sales_by_customer

Run to view results

# Since the number of customers can be large, for visualization purposes, we'll display the top 10 customers by sales top_customers = total_sales_by_customer.head(10) # Create a bar chart plt.figure(figsize=(12, 8)) barplot = sns.barplot(x='Sales', y='Customer Name', data=top_customers, palette='Set3') # Add labels on each bar for p in barplot.patches: plt.text(p.get_width(), p.get_y() + p.get_height() / 2, f'${p.get_width():,.2f}', va='center') # Set the title and labels plt.title('Total Sales by Customer (Top 10)') plt.xlabel('Total Sales ($)') plt.ylabel('Customer Name') plt.tight_layout() plt.show()

Run to view results

The bar chart showcases the top 10 customers by total sales, with Sean Miller generating the highest sales at $24,943.46, and Todd Sumrall the least within this group at $11,681.77. The sales values are closely grouped, indicating a relatively consistent spending pattern among these top customers.

Geographical Analysis

1. Average Profit by Region: This part of the analysis would provide insights into which regions are the most and least profitable.

2. Average Profit by City: Similar to the regional analysis, examining average profit by city would highlight specific urban markets where the business is performing well versus those where profitability is lagging.

By comparing average profits across regions and cities, the business can identify patterns and potential causes behind regional performance differences.

# Calculate average profit by region average_profit_by_region = df.groupby('Region')['Profit'].mean().reset_index() # Calculate average profit by city average_profit_by_city = df.groupby('City')['Profit'].mean().reset_index() # Since cities are numerous, we'll consider sorting the results and viewing the top ones separately average_profit_by_city = average_profit_by_city.sort_values('Profit', ascending=False) # Output the results print(average_profit_by_region) print(average_profit_by_city.head()) # Display the top cities for brevity

Run to view results

# Plot the bar chart for average profit by region plt.figure(figsize=(10, 6)) region_barplot = sns.barplot(x='Region', y='Profit', data=average_profit_by_region, palette='Set2') # Annotate each bar with the value for p in region_barplot.patches: region_barplot.annotate(format(p.get_height(), '.2f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 9), textcoords = 'offset points') plt.title('Average Profit by Region') plt.xlabel('Region') plt.ylabel('Average Profit ($)') plt.xticks(rotation=45) plt.tight_layout() plt.show()

Run to view results

Profit margins are relatively consistent across regions, with the East leading slightly at an average profit of $40.59. The South has the lowest average profit at $36.85, which suggests room for improvement or different market conditions.

# Take the top 10 cities for the bar chart top_cities = average_profit_by_city.head(10) # Plot the bar chart for average profit by city plt.figure(figsize=(12, 8)) city_barplot = sns.barplot(x='Profit', y='City', data=top_cities, palette='Set3') # Annotate each bar with the value for p in city_barplot.patches: city_barplot.annotate(format(p.get_width(), '.2f'), (p.get_width(), p.get_y() + p.get_height() / 2.), ha = 'left', va = 'center', xytext = (5, 0), textcoords = 'offset points') plt.title('Average Profit by City (Top 10)') plt.xlabel('Average Profit ($)') plt.ylabel('City') plt.tight_layout() plt.show()

Run to view results

Among cities, Jamestown stands out with a significantly higher average profit of $642.89, which is remarkably greater than the others. Independence and Lafayette also show substantial average profits. These cities could be key strategic areas due to their high profitability. Minneapolis and Appleton follow with strong average profits as well, indicating they are also important markets for the business.

Category-Geography Heatmap - It depicts average sales by product category across different regions.

# Creating a pivot table for the heatmap pivot_table = df.pivot_table( index='Region', columns='Category', values='Sales', # or 'Profit' aggfunc='mean' # or 'sum' for total sales/profit ) # Creating the heatmap plt.figure(figsize=(12, 8)) sns.heatmap(pivot_table, annot=True, fmt=".1f", cmap='YlGnBu', linewidths=.5) # Adding titles and labels as necessary plt.title('Heatmap of Average Sales by Category and Region') plt.xlabel('Product Category') plt.ylabel('Region') plt.show()

Run to view results

The strong sales in the Central region, particularly in Furniture and Technology, may warrant allocating more resources there to capitalize on the high demand.

The lower average sales in the South and West regions present an opportunity for market development strategies to boost performance in these areas.

The high average sales for Technology across regions indicate it is a strong category and may benefit from additional product development, inventory investment, and targeted marketing.

The varied performance across regions suggests the need for tailored regional strategies, considering the unique demand and competition in each area.

Discount Impact Analysis

To analyze the impact of discount levels on sales and profitability, we'll use scatter plots (Wilkinson, L., 2005). Scatter plots are ideal for this analysis as they can visually depict the relationship between two variables, in this case, discount levels and either sales or profits. We'll create two scatter plots:

Discount vs. Sales: To understand how different discount levels affect sales volumes.

Discount vs. Profit: To see the impact of discount levels on profitability.

# Setting up the visualization plt.figure(figsize=(14, 6)) # Discount vs. Sales Scatter Plot plt.subplot(1, 2, 1) sns.scatterplot(x='Discount', y='Sales', data=df, alpha=0.5) plt.title('Discount vs. Sales') plt.xlabel('Discount') plt.ylabel('Sales') # Discount vs. Profit Scatter Plot plt.subplot(1, 2, 2) sns.scatterplot(x='Discount', y='Profit', data=df, alpha=0.5) plt.title('Discount vs. Profit') plt.xlabel('Discount') plt.ylabel('Profit') plt.tight_layout() plt.show()

Run to view results

Discount vs. Sales We see a wide distribution of sales values across all discount levels, with no clear upward or downward trend as the discount increases. This suggests that discounts do not have a straightforward effect on increasing sales. High sales values are seen across various discount levels, including lower discounts. Notably, there aren't many instances of very high sales at the highest discount levels (around 0.5), which could imply that larger discounts do not necessarily lead to proportionately larger sales volumes.

Discount vs. Profit The occurrence of negative profit (losses) seems to be more frequent and more severe as the discount approaches 0.5. This indicates that higher discounts are associated with lower, often negative, profits. The presence of many data points at lower discounts with positive profits suggests that moderate discounting can be sustainable, but there is a threshold beyond which discounts are likely damaging profitability.

The business may benefit from finding the optimal discount rate that maximizes sales while maintaining a healthy profit margin.

Modelling Strategy

Linear Regression

We'll use a linear regression model as our starting point to predict Profit based on the identified variables.

from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline # Selecting features and target variable features = ['Sales', 'Discount', 'Segment', 'Category','Sub-Category'] X = df[features] y = df['Profit'] # Encoding categorical variables categorical_features = ['Segment', 'Category', 'Sub-Category'] one_hot = OneHotEncoder() preprocessor = ColumnTransformer( transformers=[ ('cat', one_hot, categorical_features)], remainder='passthrough') # Splitting the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Creating and training the Linear Regression model within a pipeline model = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', LinearRegression())]) model.fit(X_train, y_train) # Predicting on the test set y_pred = model.predict(X_test) # Evaluating the model mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) mse, r2

Run to view results

The initial regression model, designed to predict profitability based on sales, discount, segment, and product categories, has been evaluated and shows the following performance metrics on the test set:

Mean Squared Error (MSE): 21653.28 R-squared (R²): 0.63

The MSE is a measure of the average squared difference between the actual and predicted values, indicating the prediction error's magnitude (Willmott & Matsuura, 2005). A lower MSE indicates a better fit to the data. In our case, the MSE value suggests that the model's predictions deviate from the actual profits by a relatively large margin on average.

R² provides a measure of how well the variability of the dependent variable is explained by the model. An R² of 0.63 means that approximately 63% of the variability in the dependent variable can be explained by the model. This suggests that the model has a moderate level of predictive power.

While this is a positive indication that some relationship has been captured, a relatively moderate value of R² value also suggests there's a room for improvement in model performance. Thus, we will explore more complex models beyond linear regression, such as random forests which might capture non-linear relationships more effectively (Breiman, L., 2001).

Random Forest Regressor

We will now implement and evaluate a Random Forest model using our dataset.

from sklearn.ensemble import RandomForestRegressor # Creating and training the Random Forest model within a pipeline rf_model = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))]) rf_model.fit(X_train, y_train) # Predicting on the test set y_pred_rf = rf_model.predict(X_test) # Evaluating the model mse_rf = mean_squared_error(y_test, y_pred_rf) r2_rf = r2_score(y_test, y_pred_rf) mse_rf, r2_rf

Run to view results

The Random Forest model shows significantly improved performance metrics on the test set compared to the initial linear regression model:

Mean Squared Error (MSE): 7,999.03 R-squared (R²): 0.86

These results suggest a substantial improvement in model performance. An MSE of approximately 7,999.03 indicates a lower average error in the profit predictions, which means the Random Forest model is quite effective in capturing the relationships between the variables and the profitability of the business.

An R² value of 0.86 is particularly noteworthy. It indicates that approximately 86% of the variance in profitability can be explained by the model's inputs, showcasing a strong predictive power of the Random Forest model on this dataset.

Interpretation

To make informed predictions about the profitability of different products and customer segments using the Random Forest model we've developed, following steps have been followed:

Chosen few hypothetical scenarios for which we want to predict profitability:

1. Consumer Segment, Office Supplies, Sales: $500, Discount: 10% 2. Consumer Segment, Technology, Sales: $1000, Discount: 5% 3. Consumer Segment, Furniture, Sales: $750, Discount: 15%

In future, this could be for new products you're considering introducing, existing products in new segments, or hypothetical scenarios to understand potential profitability impacts.

For each scenario, you'll need to create a data input that matches the model's expected format, including one-hot encoding for categorical variables like Segment and Category.

Applied the model to these inputs to predict the profitability for each scenario.

# Define the scenarios as a DataFrame scenarios_df = pd.DataFrame({ 'Segment': ['Consumer', 'Consumer', 'Consumer'], 'Category': ['Office Supplies', 'Technology', 'Furniture'], 'Sub-Category': ['Accessories', 'Copiers', 'Phones'], 'Sales': [500, 1000, 750], 'Discount': [0.10, 0.05, 0.15] }) # Predicting profitability for the defined scenarios predicted_profit = rf_model.predict(scenarios_df) # Adding predictions to the scenarios DataFrame scenarios_df['Predicted_Profit'] = predicted_profit scenarios_df

Run to view results

Next Steps

Review the Results: Pay attention to any predicted profitability values, especially look for extremes (very high or very low values) that might indicate particularly profitable or unprofitable products or segments.

Analysis and Insights: Derive insights from these predictions. Are there any patterns or trends in the data? Which customer segments or product categories appear most profitable? Are there any surprises or unexpected results that could lead to new questions or investigations?

Business Strategy: Consider how these insights can influence your business strategy. This might involve adjusting product offerings, reevaluating pricing strategies, focusing marketing efforts on certain segments, or even exploring new market opportunities.

Communicate Findings: Share the insights with relevant stakeholders in your organization. Use the data to support your recommendations for strategic changes or further investigations.

Conclusion

The analysis revealed the Random Forest model as a superior predictor of profitability, achieving an R² of 0.86, suggesting a strong ability to explain the variance in profitability. The exploration highlighted the impact of customer segments, product categories, and discounts on sales and profits. For instance, technology products emerged as highly profitable across regions, and discounts were found to have a nuanced impact on profitability. These insights are crucial for strategic decision-making, guiding the superstore in optimizing product offerings, marketing strategies, and discounting practices to enhance overall profitability.

References

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Few, S. (2009). Now you see it: simple visualization techniques for quantitative analysis. Analytics Press.

Hubert, M., & Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Computational Statistics & Data Analysis, 52(12), 5186-5201.

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Tufte, E. R. (2001). The visual display of quantitative information (Vol. 2, No. 9). Cheshire, CT: Graphics press.

Wilkinson, L. (2005). The grammar of graphics. Springer Science & Business Media.

Willmott, C. J., & Matsuura, K. (2005). Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate research, 30(1), 79-82.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}C11BD-CW2

Introduction

Methodology

Importing the dataset

Data Preparation and Cleaning

Explorating Data Analysis

Box Plots

Interquartile Range (IQR)

Sales and Profit Trends Chart

Profitablility Analysis

Customer Segment Analysis

Geographical Analysis

Discount Impact Analysis

Modelling Strategy

Linear Regression

Random Forest Regressor

Interpretation

Conclusion

References

C11BD-CW2