C11BD - CW2 - H00447550

1. Introduction

The incorporation of data has become an integral part of retail businesses around the globe which enhance supply chain management, improve marketing strategies and, hence, the prosperity and profitability of business in a competitive business environment (Gandomi & Haider, 2015). The study covers a number of key aspects and provides in-depth analysis that a retail organization can use to improve its operations and thereby optimize profits. The data set analyzed consists 9993 records across 29 features including consumer data, sales transactions and product specifics, to conduct an exploratory analysis, preprocess and cleanse data, summarize statistics, visualize insights, and apply suitable modeling techniques to pinpoint factors crucially affecting the company's profitability.

The report focuses on data preparation and exploratory analysis in the initial sections, focusing on the challenges related to data quality and the strategies to address them. The later sections details the selected modelling approach and it's implementation. Lastly, the outcomes and insights have been discussed from applying the modelling strategies and implications of these results have been discussed.

2. Importing Data Set

import pandas as pd # Import pandas library for the analysis of data sets import numpy as np # Import numpy Library for scientific computation import matplotlib.pyplot as plt # Import matplotlib library for data visualizations import seaborn as sns # Import seaborn library for data visualizations

Run to view results

# Using read_csv function to read/load the dataset CSV file df= pd.read_csv("dataset_Superstore.csv")

Run to view results

# Using .head() function for displaying the first few rows of the dataframe print ("First few rows of the dataframe:") df.head()

Run to view results

# Using .tail() function for displaying the last few rows of the dataframe print ("Last few rows of the dataframe:") df.tail()

Run to view results

# Using .info() function that prints detailed information about the dataframe print("Detailed Information of Dataframe", df.info())

Run to view results

# Using .shape() function for obtaining the shape of a given array print("Shape:", df.shape)

Run to view results

The outputs above clearly shows the completion of the data. Here in (9994, 29), 9994 shows the quantity of rows in the data frame which means there are 9994 observations. 29 represents the number of columns telling us that there are 29 different variables or features in the dataset.

3. Data Cleaning and Filtering

Data cleaning is a fundamental step in the data processing phase of any data analysis and machine learning projects that comes with a lot of benefits. The data cleaning steps significantly improve the data quality, that leads to more accurate analysis and robust model performances (Jesmeen, et al., 2018). This section will involve the following steps, checking for missing values, converting the data types and bringing consistency into data, identifying outliers and removing them with IQR method.

3.1 - Missing Values

In big data analytics it is very important to check for missing values in the data frames due to data integrity and quality, impact on analysis, informed imputations and understanding data patterns (Emmanuel, et al., 2021).

# Check for the missing values in the dataframe # Using .isnull() method to check for missing values in the dataset missing_values = df.isnull().sum() # # Using .sum() method to calculate the total number of missing values in each column separately print("Missing values per column:") print(missing_values)

Run to view results

No missing value has been found in the respective data frame. The next step is to convert the order date and ship date into date time format for better and correct data analysis.

3.2 - Datetime Format

# Using.to_datetime() function to convert order date and ship date to datetime format df['Order Date'] = pd.to_datetime(df['Order Date'], dayfirst=True) df['Ship Date'] = pd.to_datetime(df['Ship Date'], dayfirst=True)

Run to view results

3.3 - Converting Boolean values (True/False) to Integers (1/0)

In order to match compatibility with mathematical operations and modelling requirements, 'returned' columns has boolean values which will be converted to integers.

# Using .astype() method to convert boolean (1 for True and 0 for False) format to integer df['Returned'] = df['Returned'].astype(int)

Run to view results

3.4 - Duplicate Data Identification

# Using .duplicated() method to identify the duplicated rows duplicates = df.duplicated()

Run to view results

# Using .sum() method to count the number of duplicates present count_duplicates = duplicates.sum() print(f"Count of Duplicates: {count_duplicates}")

Run to view results

The result shows there are no duplicate rows in the dataset.

3.5 - Outliers Identification

Locating outliers is fundamental in data analysis so the data is not skewed in any way and there is no exceptions. Outliers tend to distort the statistics and over time can reduce the performance of models based on linear and space distance algorithms. At times, they indicate the data entry error, but it can also reveal a way to some important findings and patterns that were unknown. Feature engineering and data normalization difficultly deal with issues like outliers, which in turn can cause problems like a degradation of features and scaling distortions which finally lead to misleading analysis and the wrong decision making.

df.describe()

Run to view results

Analyzing the data above, it can be seen that the values in sales, quantity, discount and profit shows variations which can be due to the presence of outliers. For in depth analysis , the outliers will first be identified and then removed.

3.6 - Identification and Removal of Outliers

The IQR (Interquartile Range) method is particularly effective for detecting outliers in customer data due to it's skewed distributions based on the nature of variations in data (Chatterjee, et al., 2021). The significance of the method lies in the fact that by calculating 75th percentile (Q3) and the 25th percentile (Q1) , the method focuses on the middle 50% of data points, marking the outside values as outliers. Therefore, the approach is less influenced by extreme values focusing on central portions. The quantity, sales and profit data shows variations as there might be instances where minority of consumers may have usually large transactions, enabling analyst and companies to identify consumer transactions indicative of important purchasing behaviors and not data errors. Hence, in this particular context IQR method is being used to handle outliers.

# Using IQR method to identify outliers # Using detect_outliers_iqr(data, columns) def detect_outliers_iqr(data, columns): outliers = pd.DataFrame() for column in columns: Q1 = data[column].quantile(0.25) Q3 = data[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers_in_column = (data[column] < lower_bound) | (data[column] > upper_bound) outliers = pd.concat([outliers, outliers_in_column], axis=1) outliers.columns = columns return outliers

Run to view results

# Identifying outliers columns_to_check_for_outliers = ['Quantity', 'Sales', 'Discount', 'Profit'] outliers = detect_outliers_iqr(df, columns_to_check_for_outliers)

Run to view results

# Print the number of outliers for each specified column for column in columns_to_check_for_outliers: print(f"Outliers Detected in '{column}': {outliers[column].sum()}")

Run to view results

The 'Sales and Profit' exhibits a broad spectrum of values and significant outliers represents transactions with notably higher profits and sales than most. These outliers aren't always error but they actually at times signify key business operations such as transactions involving higher volumes and highly profitable sales with lesser volumes. Identifying these are vital as they may point to pivotal customers, lucrative sales tactics, successful products leading to better operational strategies. On the other hand, 'Quantity' also comes with outliers but to not much as the other two variables, thereby having transaction quantities with very high amounts that do not have significant effect on deviations. The 'Discount' distribution concentrates mainly on smaller the happier discounts, and at the same time outliers sometimes show higher than usual what is more discounts. However, these very unusual examples are the more valuable ones, showing the company that it may be time to put these sales-boosters into action- targeted promotions, clearance events or enhanced discounting, just to name a few, revealing how the business is grasping its marketing endeavors.

3.6a - Removal of Outliers

# Removing outliers detected above df_cleaned = df[~outliers.any(axis=1)] # Using print to see the numbers of outlier removed print(f"Outliers Removed: {len(df) - len(df_cleaned)}") # Print the shape of the new DataFrame without outliers print(f"Shape of the DataFrame after Removal of Outlers: {df_cleaned.shape}")

Run to view results

df_cleaned.head()

Run to view results

# Printing the shape of the DataFrame without outliers print("Shape of the DataFrame without outliers:", df_cleaned.shape)

Run to view results

4. Calculating Summary Statistics

4.1 - Quantitative Variables:

The variables that represent the quantitative data in a data set are numerical variables and the summary statistics for such variables include central tendency, measure of dispersion and others like count, min and max helping analysts to understand variability, distribution and central location of the data.

# Using .describe() function for summary statistics quantitative_variables = ['Sales', 'Quantity', 'Profit', 'Discount', 'Returned'] quantitative_summary_statistics = df_cleaned[quantitative_variables].describe() print (quantitative_summary_statistics)

Run to view results

Analysis: The output shows variation spread in sales figures with average sales as $75.70 with a standard deviation of $92.2. The Quantity output shows , on average each transaction involves the sale of 3.4 units, with a standard deviation of 1.88 units. The figures show a mean of $13 per deal and a standard deviation of $ 17.19. It implies that a sale’s profit is usually in the region of $13, but the deviation can be large, with a spread of $17.19, shows variability showing some products are generating higher profits. With a standard deviation of $0.01 and an average of $0.09 per transaction, discount seems to be fairly consistent. The mean of returned is 8% which means most of the sales are not being returned. The analysis lays down the need of further in-depth analysis on co-relation between the variables affecting the profits and studying the categories in detail as to which categories are giving the maximum profits for the company to focus on.

4.2 - Analyzing the Categories - Categorical Variables

Analyzing categorical data will give us a detailed insights to the most profitable category and sub category. For this data we will be focusing on the values like count, top category, unique count and frequency of the top category. Since, these variables represent categories , metrices like mean and standard deviation are not meaningful.

# Using include=['object'] and .select_dtypes() for selection of specific columns with object data qualitative_variables = df_cleaned.select_dtypes(include=['object']) # Using describle() for summary statistics qualitative_variables_summary_statistics = qualitative_variables.describe() print (qualitative_variables_summary_statistics)

Run to view results

The output above is an extremely useful snapshot of the company's operations unfolding the top aspect in every variable that directly or indirectly contributes to the company's profits. The unique customer id shows that the company serves 787 customers in US across 508 cities and 48 states that makes a huge customer base with tremendous opportunities. It offers 1675 products falling under 3 categories and 17 sub categories. The top figure shows us West California being the top region for the company and staple envelope being the most in demand product and office supplies being the top category. The company can make marketing, sales and trades offers in order to retain the top customer and increase the repeat sales in the region.

4.3 - Correlation Matrix

# Correlation Matrix plt.figure(figsize=(8, 8)) sns.heatmap(df_cleaned[['Sales', 'Quantity', 'Discount', 'Profit']].corr(), annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show()

Run to view results

The coefficient matrix above shows the correlation of respective variables with profit. Sales showing a positive correlation which is closer to 1 indicating the profit increases with increase in sales whereas the correlation value of -0.31 with discount shows that if discount increases, profit is likely to decrease. 0.23 correlation value of Quantity and Profit shows there's a weak relationship as fewer quantities sold can lead to higher profits if the products have higher profit margins. Such relationships can help the company with formulating strategies for profit enhancement. Whereas coefficient matrix helped us identify the variables creating impact on profits.

5. Graphs

5.1: Categorical Chart (Bar Chart)

A categorical bar chart is a visualization tool that is used to show the distribution of a categorical variable, or to compare several categories with respect to a numerical variable

# The quantity of orders by category and region # Using .nunique() to count the unique values in the dataframe orders_quantity_by_region_and_category = df_cleaned.groupby(['Region', 'Category'])['Order ID'].nunique() orders_quantity_by_region_and_category = orders_quantity_by_region_and_category.unstack() # Creating bar chart orders_quantity_by_region_and_category.plot(kind='bar', stacked=True) plt.xlabel('Region') plt.ylabel('No.of Orders') plt.title('Quantity of Orders by Category and Region') # Using plt.show() function for displaying the bar chart plt.show()

Run to view results

The bar graph above indicates that the products in the 'Office Supplies' category is the top category across all regions. Whereas, orders of technology and furniture varies across the regions. Now lets see which segment is contributing most to the profit of the company.

# Aggregate profit by category segment_profit = df_cleaned.groupby('Segment')['Profit'].sum().reset_index() # Sorting the categories by profit for better visualization segment_profit = segment_profit.sort_values('Profit', ascending=False) # Plotting the bar chart plt.figure(figsize=(6, 6)) plt.bar(segment_profit['Segment'], segment_profit['Profit'], color='purple') plt.xlabel('Segment') # X-axis label plt.ylabel('Profit') # Y-axis label plt.title('Profit by Segment') # Chart title plt.xticks(rotation=45) # Rotate category names for better readability plt.show()

Run to view results

The bar chart above shows that "Consumer" segment is the most profitable segment for the company. Hence, the company should invest more in this segment.

5.2 - Scatter Plot

The scatter plot between sales and profits will help us examine the relationship between these two variables. This will allow us to study the trends , observe patterns and potential correlations.

# Continuous plot: Sales vs. Profit scatter plot plt.figure(figsize=(8, 6)) sns.scatterplot(x='Sales', y='Profit', data=df_cleaned, color='orange') plt.title('Sales vs. Profit') plt.xlabel('Sales ($)') plt.ylabel('Profit ($)') plt.tight_layout() plt.show()

Run to view results

The trend of the data tends to follow a positive correlation between sales and profit. As sales increase, profit also tends to increase(by default). Spread of the Data: The data points are not lined up perfectly in a straight line, hence the deviation indicates that even with the same sales levels there are variations in profit. This implies that besides a sales increase, other reasons may affect profit. Outliers: The data points could be outside of the main cluster and fall on a far side from it. This might be a clue for further investigations to be carried out. For instance, a high sales figure with a low profit could indicate a product category with low margins or other issues. Hence, verifying the sales figure with a low margin may reveal inappropriate product categorization or other problems.

6. Modelling Strategy

Modelling strategy refers to the overall methodology and approach for the development, implementation and validation of a statistical or a machine learning model (Harrell, et al., 2012). It includes multiple steps like selection of model type, feature selection and engineering , model training , evaluation and refinement process. This leads to a structured and efficient analytical process ensuring that the developed model is reliable and accurate for making predictions.

For the same purpose, OLS (Ordinary Least Square) regression will be used to demonstrate the factors that most contribute to the profitability of this company. OLS regression is a fundamental statistical method widely used to find the estimations of relationships between a dependent variable and one or more independent variables , It's importance is recognized in various fields especially in retail sector (Liang & Wilhelmsson, 2011). We will be using OLS regression for the analysis because of the nature of the data and the variables involved. To successfully implement the model we will be following the steps as preparing data, creating feature matrix and target matrix, adding a constant to the feature matrix, fitting the OLS regression model , reviewing the model summary and interpreting the results by focusing on coefficients, p values and r squared value.

pip install scikit-learn

Run to view results

import numpy as np import pandas as pd # Separating datasets into training and testing subsets to evaluate the performance of the model from sklearn.model_selection import train_test_split # Converting categorical date into model-understandable numerical format and introducing complexity to linear model by adding polynomial features from sklearn.preprocessing import LabelEncoder, PolynomialFeatures # Implementing linear regression for quantitaive response from sklearn.linear_model import LinearRegression # Evaluating the performance of the regression model from sklearn.metrics import mean_squared_error, r2_score

Run to view results

pip install statsmodels

Run to view results

import statsmodels.api as sm from scipy import stats from sklearn.model_selection import train_test_split, cross_val_score from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error from statsmodels.stats.outliers_influence import variance_inflation_factor

Run to view results

# Selecting the numerical columns for regression quantitative_columns = ['Sales', 'Quantity', 'Discount'] # Selecting features and targets # Selecting features for the regression model based on the columns specified and selecting 'profit' as target variable X = df_cleaned[quantitative_columns] y = df_cleaned['Profit'] X = sm.add_constant(X) # Include an intercept term model = sm.OLS(y, X).fit() print(model.summary())

Run to view results

Model Fit and Significance: The R-squared of 0.320 indicates that approximately 32% of the variability in "Profit" can be explained by the models independent variables (Sales, Quantity and Discount). Although this shows a significant predictive power but a significant portion still remains unexplained. The value of adjusted R - squared which is nearly the same as R-squared suggests the model isn't complex. The value of F-statistic as 1119 with a Prob (F-statistic) as 0.00 indicates that the model is statistically significant indicating that the model provides a better fit than an intercept-only model, overall predicting that one of the predictors has a linear relationship with "Profit".

Coefficients: The value of constant coefficient of 7.54 suggests that, if all other variables held at 0, the base-line profit will be 7.53 units. The associated p-value with constant is less than 0.05, indicating its statistical significance. The p-value of sales below 0.05 affirms it's significance with 0.0793 sales coefficient indicating that holding other factors constant, a one unit increase in sales is associated with increase of approximately 0.0793 units in "Profit". . Quantity coefficient of 0.5136 shows that the increase in this variable by one unit, while everything else constant, brings about an increase in 'Profit' at the level of 1.5136 units. It is statistically significant. The value of -58.9176 Discount coefficient implies that an increase in "Discount" by one unit, whereas everything else stays constant, will cause a decline of "Profit" by 58.92 units. This strong negative correlation indicates the fact that the forfeit of margin on discounts can affect the profitability.

Implications for the Retail Superstore: The revenue and the volume could have a positive influence on the Profit, so the identification of processes and strategies to increase these could be effective. The fact that Discount had a negative effect on Profit justifies that the discount strategies should be evaluated very carefully so that the discount does not destroy the profitability. The diagnostics indicate that the assumption of normally distributed residuals could be the one that create problems and hence may affect our model reliability in the predictions and statistical tests. In addition, such discrepancies could be addressed by continuing research and possibly more sophisticated modeling techniques which would make the model have a more powerful explanatory power and accuracy.

Limitations of OSL Regression: OLS regression, however, is limited to mirror relationships and no-dependence as well as normal distributions of errors. The errors may not be constant, or homoscedasticity may not be valid all the time. For instance, issues of multicollinearity (a phenomenon where some variables are highly correlated) and sensitivity to outliers (also called heteroscedasticity) can also reduce the model’s accuracy (Greene, 2003), and interpretation of results needs to be done with caution because of the limitation of this method in directly inferring causation. However, the latter stress the fact that in case of complex data analysis there is a requirement to bear in mind thoughtful assumption testing and also selection of the additional analytical methods (Stock & Watson, 2015).

For increasing the predictability of model, we will use another model i.e. polynomial regression which will take more variables into consideration.

# Encoding categorical variables encoder = LabelEncoder() qualitative_variables = df_cleaned.select_dtypes(include='object').columns for col in qualitative_variables: df_cleaned[col] = encoder.fit_transform(df_cleaned[col]) # Select relevant numerical columns for the regression model numerical_columns = ['Sales', 'Quantity', 'Discount', 'Ship_mode', 'Segment', 'Category', 'Sub_category'] df_cleaned.loc[:, col] = encoder.fit_transform(df_cleaned[col]) y = df_cleaned['Profit'] # Create polynomial features poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X) # Split the dataset into training and testing subsets # 30% of the dataset for testing purposes and 70% for training X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.3, random_state=42) # Fit the polynomial regression model model = LinearRegression() model.fit(X_train, y_train) # Predict on the test data y_pred = model.predict(X_test) # Evaluate the model performance mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("Mean Squared Error:", mse) print("R-squared:", r2)

Run to view results

Mean Squared Error (MSE) of 181.0006: This metric measures on average the square of the difference between the observed true results and the simulation. It is often desirable to obtain lower MSE value, since it is an indicator of the predictions that are closer to the actual values. Thus, it is suggested, in this context, the model's predictions deviate from the observed 'Profit' values, on average, by a square root of nearly 13.45 units (the square root of 181.0006 is close to 13.45, and 'Profit' is the forecasting metric), taking into account.

R-squared of 0.4097 (approximately 41%): This statistic estimates percentage of variance explained in dependent variable ('Profit') by independent variable ('Sales', 'Quantity', 'Discount'). The R-square of about 41% infers that around 41 percent of the changes in 'Profit' could be associated with sales, quantity, and discount variations. This demonstrates the model has given a glimpse of the relationship we are trying to capture, but it also indicates that there is a noticeable portion of the variance that the model hasn't been able to explain.

Polynomial regression is likely a better option, since it can be used to model the non-linear relationships between the variables and hence is suitable to add to the linear regression (Ostertagová, 2012). It uses polynomial terms to build up a model with curves, such that the flexibility of the model is enhanced, and it can capture nonlinear relationships between variables (Bradley & Srivastava, 2012). Although it has been developed by the most sophisticated means, such it still remains the most simple in terms of coefficient interpretation and analysis since it is like the linear model. This kind of process outshines in element engineering, optimizing fit and prediction precision for non-linear datasets, and enables the observation of variable interactions and the discovery of the turning points in the data. Its holistic approach gives a deeper insight into the hidden data patterns which in turn makes it is a powerful tool for exploration of the data patterns.

7. Recommendations

Based on the analysis done above at multiple fronts, few recommendations are as followings in order for the organization to improve their profitability.

Segment Focused Strategies: The segment analysis showed us the top segment as "Consumer Segment". It is highly recommended to make segment specific strategies focusing on maximizing the profitability in this segment across all regions but at the same time identifying opportunities in other segments. Technology can be a booming segment for the company.

Regional Expansion: The company should focus on the low performing categories and subcategories across regions. Marketing and trade offers can be a useful tool in this respect.

Focus on the discounts: It is highly recommended to revise the discount policies and focus more on it's impact on the profits. The discount policies and pricing strategies should be evaluated every now and then. Focus on seasonality discounts rather than continuous ones.

Focus on data driven decisions should be encouraged. Strategize offers for the top performing customers.

8. References

Bradley, R. & Srivastava, S. .., 2012. Correlation in Polynomial Regression. The American Statistician, 33(1), pp. 11-14. Chatterjee, I. et al., 2021. Statistics-Based Outlier Detection and Correction Method for Amazon Customer Reviews. Entropy, 23(12). Emmanuel, T., Maupong, T., Semong, T. & Tabona, O., 2021. A survey on missing data in machine learning. Journal of Big Data, 8(140). Gandomi, A. & Haider, M., 2015. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), pp. 137-144. Harrell, F. et al., 2012. Regression modelling strategies for improved prognostic prediction. Statistics in Medicine , 3(2), pp. 143-152. Jesmeen, M., Hossen, S., Rahman, A. & Arif, E., 2018. A Survey on Cleaning Dirty Data Using Machine Learning Paradigm for Big Data Analytics. Indonesian Journal of Electrical Engineering and Computer Science, 10(3), pp. 1234 - 1243. Liang, J. & Wilhelmsson, M., 2011. The value of retail rents with regression models: a case study of Shanghai. Journal of Property Investment & Finance, 29(6). Ostertagová, E., 2012. Modelling using Polynomial Regression. Procedia Engineering, Volume 48, pp. 500-506.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}1. Introduction

2. Importing Data Set

3. Data Cleaning and Filtering

3.1 - Missing Values

3.2 - Datetime Format

3.3 - Converting Boolean values (True/False) to Integers (1/0)

3.4 - Duplicate Data Identification

3.5 - Outliers Identification

3.6 - Identification and Removal of Outliers

3.6a - Removal of Outliers

4. Calculating Summary Statistics

4.1 - Quantitative Variables:

4.2 - Analyzing the Categories - Categorical Variables

4.3 - Correlation Matrix

5. Graphs

5.1: Categorical Chart (Bar Chart)

5.2 - Scatter Plot

6. Modelling Strategy

7. Recommendations

8. References

1. Introduction