C11BD - CW2 - H00456744_Muhammad Ahsan Iqbal

1.Introduction

Big data and data analysis are essential to the retail industry because they offer priceless insights on the behavior, tastes, and buying patterns of customers. Retailers may effectively cater their marketing strategies and product offers to the specific needs of their target market by utilizing these insights. This will improve the entire shopping experience and increase customer satisfaction. Retailers may also save costs and increase efficiency by using data analysis to improve inventory management, predict demand variations, and simplify supply chain processes. Retailers may also recognize new trends, foresee changes in the industry, and maintain an advantage over rivals in a marketplace that is becoming more and more competitive by utilizing data. In general, data-driven decision-making has become essential for retail companies looking to spur expansion, raise earnings, and sustain growth.

The asset provided here datasets with 29 attributes and 9994 items which comprises information about the retailer's items, clients and sales. Using the given dataset to look at the factors that have the biggest effect on this firm's profitability is the main purpose of the analytics. This entails running the dataset, cleaning and processing the data, calculating summary statistics, visualizing the data, and selecting the appropriate modeling method. The analytics part will be done using Python and the Deepnote platform. Also, a documentation will also be prepared that showcases the methodology, results and interpretations.

The report is organized as follows: the first section discusses exploratory data analysis and data preparation, with a focus on managing data quality concerns. The second section describes the modeling approach that was selected and how it was used. The third section presents the modeling strategy's analysis and findings. The needs and potential limitations of the company are taken into consideration while discussing the outcomes in the last section. By conducting this data analysis, the retail organization will gain valuable insights into its business operations and identify potential areas for development. Gaining more insight into the factors influencing its profitability will enable the company to make informed decisions best strategies, which in return will bring growth and profit

2) Importing the data

import pandas as pd #For data analysis we import Pandas library import numpy as np # For Sceintific computing we import numpy library import matplotlib.pyplot as plt # We Import matpotlib library for data visualization import seaborn as sns #import seaborn for data science and machine learning tasks from sklearn.model_selection import train_test_split #For quick and efficient way to prepare our data for machine learning models from sklearn.linear_model import LinearRegression # module implements a variety of linear models. from sklearn.metrics import mean_squared_error #module includes score functions, performance metrics and pairwise metrics and distance computations.

Run to view results

# We use code read_csv function to read the CSV file named as dataset_Superstore in our data file provided dataset_store = pd.read_csv("dataset_Superstore.csv")

Run to view results

#use head() to see total rows in the data set dataset_store.head()

Run to view results

#we use .info() function form complete data set visibility print('Information', dataset_store.info())

Run to view results

We can see for the the above data output that there are total 29 variables(columns) and for each variable there are 9994 observations/rows with no missing values

We use Function .rename to rename all the columns in the data set from uppercase to lowercase for ease of use. We will also convert dashes to underscore for columns

2.1) Renaming the columns

# we use .rename() to change the name of columns dataset_store = dataset_store.rename(columns= { 'Row ID': 'row_id', 'Order ID': 'order_id', 'Order Date': 'order_date', 'Ship Date': 'ship_date', 'Ship Mode': 'ship_mode', 'Customer ID': 'customer_id', 'Customer Name': 'customer_name', 'Customer_no': 'customer_no', 'Segment': 'segment', 'Segment_no': 'segment_no', 'Country': 'country', 'City': 'city', 'State': 'state', 'State_no': 'state_no', 'Postal Code': 'postal_code', 'Region': 'region', 'Region_no': 'region_no', 'Product ID': 'product_id', 'Category': 'category', 'Category_no': 'category_no', 'Sub-Category': 'sub_category', 'Sub-Category_no': 'sub_category_no', 'Product Name': 'product_name', 'Product Name_no': 'product_name_no', 'Sales': 'sales', 'Quantity': 'quantity', 'Discount': 'discount', 'Profit': 'profit', 'Returned': 'returned' })

Run to view results

3)Cleaning Data

The data has to be cleaned in the ways listed below in order to improve the data set's quality and accuracy.

3.1 Find the missing values in columns

# .Isnull() is used for checking missing values # Sum() will show number of mising values in every column print("Missing values distribution: ") print(dataset_store.isnull().sum()) print("")

Run to view results

We can see that there are no missing values in any columns. It is important to remove any missing values in the dataset as they can give incorrect findings which in return can result in wrong and biased analysis.

3.2 Find duplicates and remove in dataset

Identifying and removing duplicate data in Python is crucial for several reasons. Firstly, duplicate data can skew analysis and modeling results, leading to inaccurate insights and decisions. By removing duplicates, we ensure that our analysis is based on a clean and representative dataset. Secondly, duplicate data can inflate storage requirements and processing time, especially when dealing with large datasets. Removing duplicates helps optimize storage space and improves the efficiency of data processing tasks. Additionally, duplicate records can introduce inconsistencies and errors in databases, leading to data quality issues and affecting the reliability of downstream processes. Therefore, by identifying and removing duplicate data, we enhance the overall quality, integrity, and usability of the dataset, facilitating more reliable analysis and decision-making.

Furthermore, "Missing data can introduce bias and reduce the efficiency of statistical analyses. It is essential to handle missing values appropriately to ensure the validity and reliability of study results" (Little & Rubin, 2019).

# we need to identify duplicates in dataset for which we will use .duplicated() method duplicate_data = dataset_store.duplicated() # Count duplicates in data # .sum() funtion for the total number of values for the duplicated columns number_duplicates = duplicate_data.sum() print(f"count of the duplicates: {number_duplicates}") # we will Drop the duplicates dataset = dataset_store.drop_duplicates()

Run to view results

3.3 Removing unnecessary data

In Python, removing unnecessary data columns is essential for several reasons. Firstly, unnecessary columns contribute to increased storage space and processing time, particularly with large datasets. By eliminating these columns, we optimize memory usage and enhance computational efficiency. Secondly, removing unnecessary columns simplifies the dataset, making it easier to understand and work with during data analysis and modeling tasks. This streamlining process improves data clarity and facilitates better insights extraction. Moreover, unnecessary columns can introduce noise and complexity into analyses, potentially leading to misleading conclusions. By removing them, we focus on the most relevant features, enhancing the accuracy and interpretability of our results. Overall, removing unnecessary data columns in Python streamlines data processing, improves analysis efficiency, and ensures that insights are based on the most pertinent information.

For this purpose columns such as "Customer no," "Segment no," "State no," "Region no," "Category no," "Sub-Category n," & "Product-Name " all act as representation of other main columns which can be neglected for analysis here.

# We use .drop() function to remove columns dataset = dataset_store.drop (['customer_no', 'segment_no', 'state_no', 'region_no', 'category_no','sub_category_no','product_name_no'], axis =1) print('Shape:', dataset.shape)

Run to view results

3.4)Find records with negative Profit

# We will look for data where profit is negative but there was no return negative_profit_with_no_return = dataset_store[(dataset_store['profit'] < 0) & (dataset_store['returned'] == False)].shape[0] print(f"Rows with negative profit but no return: {negative_profit_with_no_return}")

Run to view results

Although it may seem strange to have negative profits even in cases when the product is not returned, this could be the result of expenses exceeding revenue in a given transaction—this could be the result of discounts or a cost price that is greater than the sale price. Therefore, we would not classify these entries as errors in the dataset because they could potentially be valid.

3.5) Outlier Identification

Finding outliers in Deepnote is important for a number of reasons. Data points that greatly deviate from the rest of the dataset are known as outliers, and they can distort machine learning and statistical models. We can guarantee the quality and integrity of our models and analysis by locating outliers. Errors in measurement, incorrect data entry, or actual anomalies in the data are some of the causes of outliers. We may avoid having outliers adversely affect our data and conclusions by identifying and managing them properly. Furthermore, in a lot of real-world situations, outliers could have insightful or useful information that needs more research. As a result, by recognizing outliers in Deepnote, we may decide with confidence whether to delete, alter, or examine them further, depending on the particular context and goal.

# We use .describe() to show numerical values of columns dataset_store.describe()

Run to view results

The Interquartile Range (IQR) method is commonly used to identify and remove outliers in datasets due to its robustness and simplicity. "The Interquartile Range (IQR) method is robust against extreme values and is widely used for outlier detection and removal due to its ability to capture the variability within the data distribution while being less sensitive to outliers compared to other methods." (Tukey, 1977). Outliers, which are data points that significantly deviate from the rest of the dataset, can distort statistical analyses and machine learning models, leading to biased results and decreased model performance. The IQR method provides a reliable measure of the spread of the data by quantifying the range between the first quartile (Q1) and the third quartile (Q3). By calculating the IQR and defining the outlier boundaries as a multiple of the IQR away from the quartiles, typically 1.5 times the IQR, it becomes possible to systematically identify potential outliers in the dataset. This method is less sensitive to extreme values and is particularly useful for datasets with skewed distributions, where traditional measures like mean and standard deviation may be influenced by outliers. Removing outliers based on the IQR helps improve the integrity of the dataset, ensuring that subsequent analyses and modeling efforts are more accurate and reliable. Additionally, the IQR method is intuitive and easy to implement, making it a practical choice for outlier detection and data preprocessing tasks.

#Find outliers in specified columns of a dataset using the Interquartile Range (IQR) method. #data (DataFrame): The DataFrame containing the data.#columns (list): The name of columns to check for outliers. #Returns: #outliers (DataFrame): A DataFrame indicating outliers in specified columns. We dentify outliers by using the IQR method def detect_outliers_iqr(dataframe, target_columns): outlier_df = pd.DataFrame() for col in target_columns: first_quartile = dataframe[col].quantile(0.25) third_quartile = dataframe[col].quantile(0.75) iqr = third_quartile - first_quartile lower_bound = first_quartile - 1.5 * iqr upper_bound = third_quartile + 1.5 * iqr is_outlier = (dataframe[col] < lower_bound) | (dataframe[col] > upper_bound) outlier_df = pd.concat([outlier_df, is_outlier], axis=1) outlier_df.columns = target_columns return outlier_df # Identifying outliers using the interquartile range method target_cols = ['quantity', 'sales', 'discount', 'profit'] outliers_df = detect_outliers_iqr(dataset, target_cols) # Print total outliers for col in target_cols: print(f"Count of outliers in '{col}' column: {outliers_df[col].sum()}")

Run to view results

3.6) Remove outliers

# Now we will remove outliers and present the data dataset_after_outliers = dataset[~outliers_df.any(axis=1)] # Show the count of outliers from the dataset that have been taken out print(f"Count of outliers taken out: {len(dataset) - len(dataset_after_outliers)}") # Show new shape of DataFrame after outliers print(f"Shape of Dataset after taking out outliers: {dataset_after_outliers.shape}")

Run to view results

dataset_after_outliers.head()

Run to view results

4. Summary statistics of the cleaned data

For a comprehensive analysis, We can include various types of variables in the summary statistics. Here are some common types of variables to consider:

Numeric Variables:

Continuous variables: These include measurements like age, income, and temperature. Discrete variables: These are numeric variables with a finite set of possible values, such as counts of items or scores on a scale. Categorical Variables:

Categorial/Nominal variables:

These represent categories without any inherent order, such as gender, ethnicity, or occupation. Ordinal variables: These represent categories with a natural order, such as education level

4.1) Numerical values in dataset

numerical_column = ['quantity','sales', 'discount','profit'] numerical_summary_of_data = dataset_after_outliers[numerical_column].describe() print(numerical_summary_of_data)

Run to view results

These summary statistics provide information about the variables sales, quantity, discount, and profit:

Count: Indicates the number of non-null values for each variable. For example, there are 7140 non-null values for each variable, indicating that there are no missing values in the dataset.

Mean: Represents the average value of each variable. For instance, the average sales amount is approximately $75.70, the average quantity sold is around 3.41 units, the average discount applied is approximately 9.67%, and the average profit earned is about $13.00.

Standard Deviation (std): Measures the dispersion or spread of the values around the mean. A higher standard deviation indicates greater variability in the data. In this case, we can see that sales and quantity have higher standard deviations compared to discount and profit, suggesting more variability in sales amount and quantity sold.

Median (50%): Represents the middle value of the dataset when arranged in ascending order. Also known as the second quartile, it divides the data into two equal halves. In this case, the median sales amount is $37.75, the median quantity sold is 3 units, the median discount applied is 0%, and the median profit earned is $8.30.

Maximum (max): Represents the largest value observed for each variable. For instance, the maximum sales amount is $496.86, the maximum quantity sold is 9 units, the maximum discount applied is 50%, and the maximum profit earned is $70.72, indicating the highest values in the dataset.

These statistics provide valuable insights into the distribution and characteristics of the data, helping analysts understand the central tendency, variability, and range of each variable.

Recommendations : Based on this data, a suggestion might be to do a closer examination of the sales and profit distribution to identify any specific product categories or subcategories that are routinely producing larger or lower earnings. Examining the return distribution, which has a comparatively high maximum value of 1 but a relatively low mean, may also be helpful. This may point to a tendency for specific products or categories to be returned more frequently than others, which may have an impact on overall profitability. An additional suggestion may be to investigate the connection between profitability and discounts in greater detail. Does giving discounts have a bigger effect on sales or profitability for certain specific products or categories? Exist any certain discount rates that consistently result in greater or worse profits? By examining this relationship, options for increasing profitability and optimizing discounts may be identified.

4.2) Categorical values

# Selecting columns with object data type (i.e., string columns) categorical_data = dataset_after_outliers.select_dtypes(include=['object']) # finding summary stats for category values # Describing the category columns to summarize their characteristics summary_of_stats_category_values = categorical_data.describe() # Displaying the statistics print("Summary statistics for category variables:") print(summary_of_stats_category_values)

Run to view results

The key findings from the provided dataset include:

Order Information:

There are 4,184 unique order IDs recorded. Order dates range across 1,201 unique dates. Shipping dates span across 1,288 unique dates. The most common ship mode is "Standard Class," with 4,250 occurrences.

Customer Information:

There are 787 unique customer IDs. The customer name "Arthur Prichep" appears most frequently, occurring 28 times. The majority of customers belong to the "Consumer" segment, with 3,732 occurrences. Geographic Information:

All orders are from the United States. New York City is the most common city, with 709 occurrences. California is the most common state, with 1,606 occurrences. The Western region has the highest frequency of orders, with 2,494 occurrences.

Product Information:

There are 1,675 unique product IDs. Products are categorized into three main categories: Furniture, Office Supplies, and Technology. Within these categories, there are 17 sub-categories. The most common product category is "Office Supplies," with 4,682 occurrences. The most common sub-category is "Paper," with 1,236 occurrences. The product named "Staple envelope" appears most frequently, with 47 occurrences. These findings provide insights into the distribution and frequency of orders, customer segments, geographic locations, and product categories within the dataset.

Recommendation

• Focus on the West and California: Since these regions account for a significant share of orders, it would be beneficial to focus marketing and promotional efforts there. If the brand's customer and fan base grow in a particular area, sales and revenue could increase. • The two most popular categories are Office Supplies and Paper subcategories, therefore expand your product range in these areas. Consider offering new, cutting-edge products or expanding your product alternatives in these categories to attract more clients and increase revenue. • Target Consumer Segment: This group includes the great majority of customers. Develop marketing campaigns, promotions, and product lines specifically for this market in order to foster growth. • Develop strategies for retaining customers, like as loyalty programs, special offers, or tailored marketing campaigns, to maintain the attention of present clients and encourage return business.

5) Plot the data for categorial and continuous data sets

5.1) Profit by segment

# Group dataset by segment category and calculate total profit for each category profit_by_segment = dataset_after_outliers.groupby('segment')['profit'].sum().reset_index() # Visualizing the total profit by segment category using a bar chart plt.figure(figsize=(10, 6)) plt.bar(profit_by_segment['segment'], profit_by_segment['profit'], color='pink') plt.xlabel('Segment Category') plt.ylabel('Total Profit in $') plt.title('Total Profit Distribution by Segment Category') plt.show()

Run to view results

Total Profit by Segment Category graph illustrates the profitability of different segment categories within the dataset. From the graph, it is evident that the Consumer segment generates the highest profit, followed by the Corporate segment and Home Office segment. This indicates that the company's focus on the Consumer segment is yielding the most significant returns in terms of profit. It also suggests potential opportunities for further growth or investment in strategies targeting the Corporate and Home Office segments to increase profitability across all segments. Additionally, understanding the profit distribution among different segment categories can help the company allocate resources more effectively and tailor its marketing strategies to maximize profitability.

5.2) Orders by Region and Product Category

# Analyzing the distribution of orders across different regions and product categories # Grouping the dataset by region and product category and computing the count of orders orders_by_region_category_product = dataset_after_outliers.groupby(['category', 'region']).size().unstack() # Creating a stacked bar chart to visualize the distribution orders_by_region_category_product.plot(kind='bar', stacked=True) # Adding labels and title to the chart plt.xlabel('Product') plt.ylabel('Number of total Orders') plt.title('Distribution of Orders across Product Categories and Regions') plt.legend(title='Regions', bbox_to_anchor=(1.05, 1), loc='lower left') # Displaying the plot plt.show()

Run to view results

Interpretation: The "Number of Orders by Region and Product Category" graph provides valuable insights into the distribution of orders across different regions and product categories within the dataset.

Observations from the graph include:

The West region has the highest number of orders, followed by the East, Central, and South regions. This suggests that the company might have a stronger presence or customer base in the West region compared to other regions. Among the product categories, Office Supplies have the highest number of orders, followed by Furniture and Technology. This indicates that customers are more frequently purchasing office supplies compared to other product categories. Understanding the regional and product category distribution of orders can help the company make informed decisions regarding inventory management, marketing strategies, and resource allocation. For example, the company may consider increasing its marketing efforts for products in regions with lower order volumes or optimizing inventory levels based on the popularity of different product categories in specific regions.

5.3) Sales by Product Category and Sub-Category

# Analyzing the total sales across different product categories and subcategories # Grouping the dataset by product category and subcategory and computing the total sales total_sales_by_category_subcategory = dataset_after_outliers.groupby(['category', 'sub_category'])['sales'].sum().unstack() # Creating a stacked bar chart to visualize the total sales total_sales_by_category_subcategory.plot(kind='bar', stacked=True, figsize=(10, 6)) # Adding labels and title to the chart plt.xlabel('Product Type Category') plt.ylabel('Total Sales in $') plt.title('Total Sales by Product Category and Subcategory') plt.legend(title='Subcategory', bbox_to_anchor=(1.1, 1), loc='upper left') # Displaying the plot plt.tight_layout() plt.show()

Run to view results

The "Total Sales by Product Category and Subcategory" graph provides a comprehensive overview of the sales distribution across different product categories and subcategories within the dataset.

Key observations from the graph include:

Office Supplies category contributes significantly to total sales, with Binders, Paper, and Storage being the top-selling subcategories within this category. This suggests that office supplies are popular among customers and generate substantial revenue for the company.

Furniture category also makes a substantial contribution to total sales, with Furnishings and Chairs being the top-selling subcategories within this category. This indicates that customers are purchasing furniture items, contributing significantly to the company's revenue.

Technology category, while not as dominant in total sales as Office Supplies and Furniture, still shows strong performance, with Phones, Accessories, and Copiers being the top-selling subcategories. This suggests that technological products are also in demand among customers.

Understanding the sales distribution across different product categories and subcategories is crucial for the company to identify its best-selling products, capitalize on emerging trends, and allocate resources effectively. By analyzing this data, the company can make informed decisions regarding product offerings, marketing strategies, and inventory management to maximize sales and profitability.

5.4) Total Profit and Sales by Category and Sub-category

# Analyzing the total profit and sales across different product categories and subcategories # Grouping the dataset by product category and subcategory and computing the total sales & profit total_profit_sales_by_category_subcategory = dataset_after_outliers.groupby(['category', 'sub_category'])[['profit', 'sales']].sum() # Creating a horizontal stacked bar chart to visualize the total profit and sales total_profit_sales_by_category_subcategory.plot(kind='barh', stacked=True, figsize=(10, 6)) # Adding labels and title to the chart plt.xlabel('Total Amount') plt.ylabel('Product Category and Subcategory') plt.title('Total Profit and Sales by Product Category and Subcategory') plt.legend(title='Metric', bbox_to_anchor=(1.05, 1), loc='upper left') # Displaying the plot plt.tight_layout() plt.show()

Run to view results

The "Total Profit and Sales by Product Category and Subcategory" graph provides valuable insights into both the revenue and profitability generated across different product categories and subcategories within the dataset.

Key observations from the graph include:

Office Supplies category, which demonstrated high sales in the previous graph, also exhibits significant profitability. Despite being the top contributor to sales, it is noteworthy that certain subcategories within Office Supplies, such as Binders and Paper, appear to have relatively lower profitability compared to their sales volume. This may indicate areas where cost optimization or pricing adjustments could enhance profitability.

Furniture category, while contributing substantially to sales, shows a more varied pattern in terms of profitability across its subcategories. For instance, while Furnishings and Chairs contribute significantly to both sales and profitability, Tables appear to have relatively lower profitability despite decent sales volume. This suggests a potential opportunity to analyze and address the factors impacting the profitability of specific furniture items.

Technology category demonstrates a strong correlation between sales and profitability across its subcategories. Products like Phones and Copiers not only generate high sales but also contribute significantly to overall profitability. This indicates that technological products, in general, are lucrative for the company, highlighting the importance of investing in and promoting these items effectively.

Analyzing the relationship between sales and profitability at the category and subcategory levels is essential for the company to prioritize its product offerings and optimize its operational strategies. By identifying areas where profitability lags behind sales volume, the company can implement targeted measures to improve cost efficiency, pricing strategies, and resource allocation, ultimately enhancing overall profitability.

5.5) Total Sales over time

# Ensure that the 'order_date' variable is in datetime dataset['order_date'] = pd.to_datetime(dataset['order_date']) # Group the data by date & then measure the sales over each date total_sales_over_date = dataset.groupby('order_date')['sales'].sum() # Plot a 7-day moving average trend line plt.figure(figsize=(10, 6)) plt.plot(total_sales_over_date.index, total_sales_over_date.rolling(window=7).mean(), color='red', linestyle='--', label='7-Day Moving Average') # Add labels and title plt.xlabel('Time Period') plt.ylabel('Total Sales ($)') plt.title('Trend of Total Sales Over Time (7-Day Moving Average)') # Add legend plt.legend() # Display the plot plt.grid(True) plt.tight_layout() plt.show()

Run to view results

The "Total Sales Over Time" graph, presented as a line scatter diagram, offers valuable insights into the overall sales trend across the period covered by the dataset.

Key observations from the graph include:

Trend Identification: The graph depicts the fluctuation in total sales over time, providing a visual representation of the sales trend throughout the analyzed period. By plotting the total sales on the y-axis against the date on the x-axis, it becomes evident whether sales are increasing, decreasing, or remaining relatively stable over time.

Long-Term Performance: By analyzing the overall trajectory of sales over time, businesses can assess their long-term performance and identify any significant shifts or anomalies in sales patterns. This insight enables companies to make informed decisions about strategic planning, resource allocation, and business growth initiatives.

Forecasting and Planning: The graph serves as a foundation for sales forecasting and future planning. By extrapolating the observed trends, businesses can make informed predictions about future sales performance, allowing them to adjust their operations, marketing campaigns, and product offerings accordingly.

Overall, the "Total Sales Over Time" graph provides a comprehensive overview of sales performance, enabling businesses to identify patterns, anticipate future trends, and make data-driven decisions to drive growth and profitability.

Continuous(scatter plot)

5.6) Sales and Profit with Quantity of Items Sold

# Create a scatter plot of sales and profit, with the quantity of items sold indicated by marker size and color plt.figure(figsize=(10, 6)) plt.scatter(dataset_after_outliers['sales'], dataset_after_outliers['profit'], s=dataset_after_outliers['quantity']*10, c=dataset_after_outliers['quantity'], cmap='magma', alpha=0.5) # Change 'viridis' to 'magma' for a different color map # Add title and labels to the plot plt.xlabel('Sales in $') plt.ylabel('Profit in $') plt.title('Relationship Between Sales and Profit, with Quantity of Items Sold Represented by Marker Size and Color') # Add colorbar color_bar = plt.colorbar() color_bar.set_label('Quantity') # Show the plot plt.grid(True) plt.tight_layout() plt.show()

Run to view results

5.7) Sales and Discount with Quantity of items Sold

import matplotlib.pyplot as plt # Make a scatter plot for sales and discount, with Color & size indicating the total quantity of items sold plt.figure(figsize=(10, 6)) plt.scatter(dataset_after_outliers['discount'], dataset_after_outliers['sales'], s=dataset_after_outliers['quantity']*10, c=dataset_after_outliers['quantity'], cmap='viridis', alpha=0.5) # Add labels for axis and a title to the plot plt.xlabel('Discount') plt.ylabel('Sales') plt.title('Relationship Between Sales and Discount, with Quantity of Items Sold Represented by Marker Size and Color') # Add colorbar for the range of values color_bar = plt.colorbar() color_bar.set_label('Quantity') # Show the plot plt.grid(True) plt.tight_layout() plt.show()

Run to view results

The Sales and Discount with Quantity of Items Sold Indicated by Marker Size and Color graph in the provided file offers insights into the relationships between sales, discount rates, and the quantity of items sold. Here are some observations and comments on this visualization:

Multivariate Representation: Similar to the previous graph, this plot presents multiple variables simultaneously, including sales, discount rates, and quantity of items sold. By encoding these variables into marker size and color, the graph provides a comprehensive view of their interrelationships.

Marker Size and Color Encoding: Marker size and color effectively represent the quantity of items sold, with larger markers indicating higher quantities and color intensity highlighting variations. This encoding scheme enhances the visualization by integrating additional dimensions of information, making it easier to discern patterns and trends.

Sales-Discount Relationship: The plot reveals the relationship between sales and discount rates. It allows for the identification of any correlations between higher discounts and increased sales volumes. Analyzing the distribution of data points across different discount levels provides insights into the effectiveness of discount strategies in driving sales.

Outlier Detection: Outliers, characterized by data points deviating significantly from the general trend, can be identified on the plot. These outliers may represent unique sales opportunities or anomalies that require further investigation. Understanding the factors contributing to these outliers is essential for refining pricing strategies and optimizing discount offerings.

Insights for Pricing Strategies: The visualization offers valuable insights for pricing and discounting decisions. By examining the relationships between sales, discount rates, and quantity of items sold, businesses can optimize their pricing strategies to maximize revenue while maintaining competitiveness in the market.

In summary, the Sales and Discount with Quantity of Items Sold Indicated by Marker Size and Color graph provides a clear and informative visualization of key sales metrics. It enables businesses to evaluate discount strategies, identify sales opportunities, and optimize pricing decisions based on empirical data.

6) Modelling strategies to show factors which can contribute to profitability for the business

pip install scikit-learn

Run to view results

from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score from sklearn.preprocessing import LabelEncoder from sklearn.metrics import mean_squared_error, r2_score

Run to view results

Linear regression analysis is used to explore the relationship between various independent variables (features) and the dependent variable (profitability). Here are some reasons why linear regression analysis is chosen

Interpretability: Linear regression provides coefficients for each independent variable, allowing us to interpret their impact on the dependent variable easily.

Simplicity: Linear regression is relatively simple and easy to understand compared to more complex models, making it a good starting point for analysis.

Assumption Checking: Linear regression allows us to check assumptions such as linearity, homoscedasticity, and normality of residuals, which are important for ensuring the validity of the model.

to further strengthen our claim for chosen strategy, "Linear regression analysis is a powerful technique for analyzing the relationship between two or more quantitative variables. It is widely used in various fields such as economics, finance, engineering, and social sciences to model and understand the relationships between variables. Regression analysis helps researchers to quantify the impact of independent variables on the dependent variable and make predictions based on the observed data." (Montgomery & Vining, 2012)

6.1)Linear regression

!pip install statsmodels==0.14.1

Run to view results

import statsmodels.api as sm from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score # Step 1: Encode categorical variables encoder = LabelEncoder() categorical_columns = dataset_after_outliers.select_dtypes(include='object').columns for col in categorical_columns: dataset_after_outliers[col] = encoder.fit_transform(dataset_after_outliers[col]) # Step 2: Select features and target variable features = ['sales', 'quantity', 'discount', 'segment', 'category', 'sub_category'] X = dataset_after_outliers[features] y = dataset_after_outliers['profit'] # Step 3: Split dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Step 4: Fit the multiple linear regression model X_train = sm.add_constant(X_train) model = sm.OLS(y_train, X_train).fit() # Step 5: Predict on the test data X_test = sm.add_constant(X_test) y_pred = model.predict(X_test) # Step 6: Evaluate model performance mse = mean_squared_error(y_test, y_pred) r_squared = r2_score(y_test, y_pred) # Print evaluation metrics print("Mean Squared Error:", mse) print("R-squared:", r_squared) print("F-statistic:", model.fvalue) print("P-value:", model.f_pvalue) # Print coefficient magnitudes print("\nCoefficient Magnitudes:") coefficients = model.params.drop('const') coefficients_abs = coefficients.abs().sort_values(ascending=False) print(coefficients_abs) # Print feature importance (if applicable) # If using other models like Random Forest or Gradient Boosting # model.feature_importances_ would be used to obtain feature importance scores

Run to view results

The output provides important insights into the performance and impact of different variables in the multiple linear regression model:

Mean Squared Error (MSE): This measures the average squared difference between the predicted values and the actual values. In this case, the MSE is approximately 200.08, indicating the average squared error between the predicted and actual profit values.

R-squared: R-squared is a measure of how well the independent variables explain the variability of the dependent variable. In this model, approximately 34.75% of the variability in profit can be explained by the independent variables.

F-statistic: This statistic tests the overall significance of the regression model. A higher F-statistic and a lower associated p-value suggest that at least one independent variable is significantly related to the dependent variable. Here, the F-statistic is relatively high at 422.27, and the associated p-value is 0.0, indicating that the overall model is statistically significant.

Coefficient Magnitudes: The magnitudes of the coefficients represent the strength and direction of the relationship between each independent variable and the dependent variable (profit). Larger absolute values indicate a stronger impact. From the coefficients, we observe that 'discount' has the largest impact on profit, followed by 'category', 'quantity', 'segment', 'sub_category', and 'sales' in descending order of magnitude.

# Scatter plot of actual vs. predicted values plt.figure(figsize=(10, 6)) plt.scatter(y_test, y_pred, alpha=0.5) plt.plot(y_test, y_test, color='red', linestyle='--') # Plotting the diagonal line y = x plt.xlabel('Actual Profit') plt.ylabel('Predicted Profit') plt.title('Actual vs. Predicted Profit (Linear Regression)') plt.grid(True) plt.show()

Run to view results

The scatter plot of actual versus predicted profitability showed a significant divergence from the line of best fit, indicating that while the model captured some patterns, it did not accurately predict profitability for all instances.

Based on these findings, it was concluded that the linear regression model may not be accurately representing the underlying correlations between the attributes and profitability. Therefore, alternative analysis models, such as polynomial regression and decision tree Regression, were explored to improve the model's performance and better understand the relationship between the variables and profitability.

6.2)Polynomial regression

"In many settings, the straight-line fit provided by linear regression is insufficient for describing the relationship between the predictors and the response. Polynomial regression extends the linear model by including terms that are powers of the predictors, allowing for more flexible and nonlinear relationships to be captured. This flexibility can be particularly useful when the relationship between the predictors and the response is curvilinear or exhibits complex patterns. However, caution should be exercised when fitting polynomial models, as higher-order terms can lead to overfitting, especially with limited data." (Hastie, Tibshirani & Friedman, 2009).

from sklearn.preprocessing import LabelEncoder, PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score import statsmodels.api as sm # Encoding categorical variables encoder = LabelEncoder() dataset_encoded = dataset_after_outliers.copy() categorical_dataset = dataset_encoded.select_dtypes(include='object').columns dataset_encoded[categorical_dataset] = dataset_encoded[categorical_dataset].apply(encoder.fit_transform) # Select relevant numerical data to run regression numerical_dataset = ['sales', 'quantity', 'discount', 'ship_mode', 'segment', 'category', 'sub_category'] X = dataset_encoded[numerical_dataset] y = dataset_encoded['profit'] # Create polynomial features poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X) # Add a constant term to the features X_poly = sm.add_constant(X_poly) # Divide the dataset into training subsets and testing subsets X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.3, random_state=42) # Fit the polynomial regression model model = sm.OLS(y_train, X_train).fit() # Predict on the test data y_pred = model.predict(X_test) # Evaluate the model performance mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) # Get the F-statistic and its p-value f_statistic = model.fvalue p_value = model.f_pvalue # Get the coefficients and their corresponding variable names coefficients = model.params[1:] # Exclude the intercept variable_names = X.columns print("Mean Squared Error:", mse) print("R-squared:", r2) print("F-statistic:", f_statistic) print("P-value:", p_value) # Print coefficient magnitudes along with variable names print("\nCoefficient Magnitudes:") for variable, coefficient in zip(variable_names, coefficients): print(f"{variable:15}: {coefficient:.6f}")

Run to view results

The output indicates the performance and coefficients of the polynomial regression model:

Mean Squared Error (MSE): The MSE measures the average squared difference between the actual profit values and the predicted profit values. In this case, the MSE is approximately 148.06, which means, on average, the squared difference between actual and predicted profits is around 148.06.

R-squared: The R-squared value (0.5171) indicates the proportion of the variance in the profit variable that is predictable from the independent variables (sales, quantity, discount, etc.). A higher R-squared value suggests that the model explains a greater proportion of the variance in the profit variable.

F-statistic: The F-statistic (154.14) tests the overall significance of the model. A higher F-statistic indicates a better fit of the model to the data.

P-value: The p-value (0.0) associated with the F-statistic indicates the probability of observing the given F-statistic value if the null hypothesis (the model has no explanatory power) were true. A low p-value suggests that the model is statistically significant.

Coefficient Magnitudes: The coefficients represent the change in profit for a one-unit change in the corresponding variable, holding other variables constant. Here are some observations:

Sales: For every one-unit increase in sales, the profit increases by approximately 0.287. Quantity: Each additional unit sold results in a profit increase of around 0.685. Discount: A one-unit increase in discount leads to a decrease in profit by approximately 19.563. Ship Mode, Segment, Category, Subcategory: These variables have positive coefficients, indicating that they positively contribute to profit. The magnitude of the coefficients reflects the strength of their impact on profit.

import matplotlib.pyplot as plt # Scatter plot of actual vs. predicted profit values plt.figure(figsize=(10, 6)) plt.scatter(y_test, y_pred, color='blue', alpha=0.5) # Plot actual vs. predicted profit plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--') # Plot diagonal line plt.xlabel('Actual Profit') plt.ylabel('Predicted Profit') plt.title('Actual vs. Predicted Profit (Polynomial Regression)') plt.grid(True) # Add grid lines plt.show()

Run to view results

The graph of the polynomial regression chart illustrates the relationship between the predicted and actual profitability of the company based on the polynomial regression model applied to the dataset. The blue scatter plot points represent the actual profitability values, while the red dashed line depicts the best-fit line for the predictions generated by the polynomial regression model.

This chart provides a visual comparison between the expected and observed profitability figures. We observe that the scatter plot points generally follow the trend of the best-fit line, indicating that the polynomial regression model captures the underlying patterns in the data reasonably well. However, there are still some scattered data points that deviate from the line, suggesting that the model may not perfectly predict profitability in all cases.

Overall, the polynomial regression model shows improved performance compared to the linear regression model, as indicated by a lower mean squared error and a higher R-squared value. This implies that the polynomial regression model better explains the variation in profitability and provides more accurate predictions, thus offering valuable insights into the factors contributing to the company's profitability.

6.3)Comparing the result of both Regression

Linear Regression Output:

Mean Squared Error (MSE): 200.08 R-squared: 0.3475 F-statistic: 422.27 P-value: 0.0

Polynomial Regression Output:

Mean Squared Error (MSE): 148.06 R-squared: 0.5171 F-statistic: 154.14 P-value: 0.0

Coefficient Magnitudes (Linear vs. Polynomial):

Sales: Linear (0.0798) < Polynomial (0.2873) Quantity: Linear (1.5306) < Polynomial (0.6848) Discount: Linear (58.9533) > Polynomial (-19.5634) Ship Mode: Linear (N/A) < Polynomial (-0.0702) Segment: Linear (0.1824) < Polynomial (0.8629) Category: Linear (3.9219) < Polynomial (22.9069) Subcategory: Linear (0.1534) < Polynomial (4.7617)

Comparison:

Model Performance: The polynomial regression model outperforms the linear regression model in terms of MSE and R-squared. The polynomial model has a lower MSE (148.06 vs. 200.08) and a higher R-squared (0.5171 vs. 0.3475), indicating better predictive accuracy and explanatory power.

Model Significance: Both models have a significant F-statistic (with p-value close to 0), indicating that they are statistically significant. However, the F-statistic for the linear model (422.27) is higher than that of the polynomial model (154.14).

Impact of Variables: The coefficient magnitudes differ between the two models. In some cases, such as sales and quantity, the direction of the impact remains consistent (positive for sales and quantity, negative for discount). However, the magnitude of the coefficients varies significantly between the two models, suggesting differences in the strength of the variables' impact on profit.

Overall, the polynomial regression model provides better performance and captures more complex relationships between the variables compared to the linear regression model.

6.4) Decision Tree Regression

Decision tree analysis is particularly well-suited for the dataset provided due to its interpretability, ability to handle mixed data types, and robustness to outliers. As noted by Hastie et al. (2009), decision trees provide a transparent and intuitive representation of decision rules, making them valuable for understanding complex relationships in data. Additionally, decision trees can handle both numerical and categorical variables without the need for extensive preprocessing, which is advantageous for datasets with mixed data types like the one under consideration. Moreover, decision trees are robust to outliers and do not assume a specific data distribution, as highlighted by Breiman et al. (1984). This resilience to outliers is beneficial for real-world datasets where anomalies are common, ensuring that the model's performance is not unduly influenced by extreme values. Furthermore, decision trees perform automatic feature selection, selecting the most informative features at each split, which aids in identifying the key factors contributing to profitability (Hastie et al., 2009). Overall, the transparency, versatility, and robustness of decision tree analysis make it an appropriate choice for exploring the complex relationships within the dataset and deriving actionable insights for decision-making.

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder from sklearn.tree import plot_tree import statsmodels.api as sm # Load the dataset dataset = dataset_after_outliers # Encode categorical values encoder = LabelEncoder() categorical_dataset = dataset.select_dtypes(include='object').columns for col in categorical_dataset: dataset[col] = encoder.fit_transform(dataset[col]) features = ['sales', 'quantity', 'discount', 'segment', 'category', 'sub_category'] X = dataset[features] y = dataset['profit'] # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Train the decision tree regression model model = DecisionTreeRegressor(random_state=42) model.fit(X_train, y_train) # Predict on the test data y_pred = model.predict(X_test) # Evaluate the model performance mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("Mean Squared Error:", mse) print("R-squared:", r2) # Add a constant term to the features for the p-value calculation X_train_with_const = sm.add_constant(X_train) # Fit the decision tree regression model using statsmodels to calculate p-value model_stats = sm.OLS(y_train, X_train_with_const).fit() # Get the F-value and its p-value f_statistic = model_stats.fvalue p_value = model_stats.f_pvalue print("F-statistic:", f_statistic) print("P-value:", p_value) model = DecisionTreeRegressor(random_state=42, max_depth=4) model.fit(X_train, y_train) # Visualize the decision tree plt.figure(figsize=(30, 30)) plot_tree(model, feature_names=features, filled=True, fontsize=10) plt.show() # Get feature importances importances = model.feature_importances_ # Create a DataFrame to display the importances feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importances}) # Sort the DataFrame by importance in descending order feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False) # Print the feature importances print(feature_importance_df) # Plot the feature importances plt.figure(figsize=(10, 6)) plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'], color='skyblue') plt.xlabel('Importance') plt.ylabel('Feature') plt.title('Feature Importances') plt.show()

Run to view results

The output of the decision tree regression includes two key evaluation metrics: the mean squared error (MSE) and the R-squared value.

The MSE value of 126.08 suggests that, on average, the model's predictions deviate from the actual profitability by approximately 126.08 units squared. The R-squared value of 0.59 indicates that around 59% of the variability in profitability can be explained by the features included in the model.

As for the graph, it represents the visual representation of the decision tree model. The nodes in the tree represent decision points based on different features, and the branches represent the possible outcomes of those decisions. The size and color of the nodes may indicate the importance of each feature in determining profitability. The feature importance scores indicate the relative importance of each feature in predicting the target variable (profit). Features with higher importance values contribute more to the model's decision-making process. In this output, the 'sales' and 'discount' features have the highest importance, followed by 'subcategory' and 'category'. Features with importance scores of 0, such as 'quantity' and 'segment', are not considered significant by the decision tree model for predicting profit.

Overall, these metrics and the decision tree graph provide insights into the model's performance and decision-making process, allowing stakeholders to understand how different features contribute to profitability and assess the model's predictive capabilities.

7) Conclusion and Recommendation

Based on the analysis conducted in the provided document, several key insights have been uncovered regarding the company's operations and performance. These insights can inform strategic decision-making and areas for improvement.

Sales and Profitability Analysis: The analysis of sales data revealed valuable insights into the company's revenue streams, with particular attention to segment profitability. Understanding which segments drive the highest profits can guide resource allocation and marketing strategies.

Order Processing and Shipping Efficiency: Temporal trends analysis highlighted the importance of efficient order processing and shipping for maintaining customer satisfaction and retention. Identifying peak periods and streamlining operations can enhance efficiency and profitability.

Product Category and Regional Distribution: Analyzing orders by region and product category provided insights into customer preferences and market demand. This information can guide inventory management and marketing efforts to maximize profitability.

Discount Strategies and Sales Performance: The analysis of sales, discounts, and quantity sold revealed the impact of discount strategies on sales performance. Optimizing discount offerings based on empirical data can enhance revenue generation.

Recommendations:

Segment-Specific Strategies:

Tailor strategies for different customer segments based on their profitability and growth potential. Identify high-performing segments and focus on maximizing profitability through targeted marketing and personalized offerings. For segments with lower profitability, explore opportunities for growth and implement strategies to increase their contribution to overall profitability.

Operational Efficiency:

Streamline order processing and shipping operations, particularly during peak periods, to improve efficiency and reduce costs. Invest in automation and technology solutions to optimize supply chain management and reduce lead times. Implement measures to enhance customer satisfaction, such as offering faster delivery options and improving order tracking systems.

Market Expansion:

Explore opportunities to expand market presence in regions with lower order volumes by investing in targeted marketing campaigns and distribution channels. Conduct market research to identify untapped segments or geographical areas with growth potential. Customize marketing strategies to resonate with local preferences and consumer behaviors in new markets.

Pricing and Discount Optimization:

Continuously evaluate discount strategies based on their impact on sales performance and profitability. Use data analytics to determine optimal pricing strategies for different product categories and customer segments. Implement dynamic pricing models that adjust prices in real-time based on demand fluctuations and market conditions.

Data-Driven Decision Making:

Emphasize the importance of data-driven decision-making across all business functions. Invest in data analytics tools and capabilities to gather insights from sales analysis, temporal trends, and customer segmentation. Use predictive analytics to forecast demand, anticipate customer preferences, and optimize inventory management.

Continuous Improvement:

Foster a culture of continuous improvement within the organization by encouraging feedback loops and adaptation based on performance metrics and market dynamics. Implement regular performance reviews and KPI tracking to monitor progress towards profitability goals. Encourage cross-functional collaboration to identify areas for improvement and implement best practices across the organization.

By implementing these recommendations, the company can enhance its operational efficiency, maximize profitability, and maintain competitiveness in the market.

References

Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data. John Wiley & Sons.

Tukey, John W. Exploratory Data Analysis. Addison-Wesley, 1977.

Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis (5th ed.). Wiley.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and Regression Trees. CRC Press.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}C11BD - CW2 - H00456744_Muhammad Ahsan Iqbal

1.Introduction

2) Importing the data

2.1) Renaming the columns

3)Cleaning Data

3.1 Find the missing values in columns

3.2 Find duplicates and remove in dataset

3.3 Removing unnecessary data

3.4)Find records with negative Profit

3.5) Outlier Identification

3.6) Remove outliers

4. Summary statistics of the cleaned data

4.1) Numerical values in dataset

4.2) Categorical values

5) Plot the data for categorial and continuous data sets

5.1) Profit by segment

5.2) Orders by Region and Product Category

5.3) Sales by Product Category and Sub-Category

5.4) Total Profit and Sales by Category and Sub-category

5.5) Total Sales over time

Continuous(scatter plot)

5.6) Sales and Profit with Quantity of Items Sold

5.7) Sales and Discount with Quantity of items Sold

6) Modelling strategies to show factors which can contribute to profitability for the business

6.1)Linear regression

6.2)Polynomial regression

6.3)Comparing the result of both Regression

6.4) Decision Tree Regression

7) Conclusion and Recommendation

References

C11BD - CW2 - H00456744_Muhammad Ahsan Iqbal