Coursework 2

Name: Varkey Thomas Pottenkulam

HW ID: H00449858

Introduction

This analysis uses a real dataset to examine the complex relationship between various factors and their impact on the company's financial performance. By employing a rigorous data-cleaning process, visualisations, and modelling techniques, this report provides a holistic approach to understanding the drivers of profitability (Provost and Fawcett, 2013)

The analysis process begins with a thorough data cleaning and pre-processing phase, aimed at ensuring the accuracy and reliability of the dataset. The Interquartile Range (IQR) method is used to identify and eliminate any outliers present in the data, thus preventing any skewed results from affecting the analysis (Iglewicz and Hoaglin, 1993). This approach not only improves the quality of the data but also enables more precise and trustworthy insights to be derived from it. Furthermore, the Random Forest Regression model is employed, renowned for its ability to handle complex, non-linear relationships between variables, making it an ideal choice for predicting profitability (James et al., 2013).

Importing the Data

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score # Loading dataset df = pd.read_csv('Superstore dataset.csv') df.describe()

Run to view results

Cleaning the Data and Removing Outliers

From the dataset, it is important to focus on removing outliers from numerical columns that directly impact the analysis, particularly those related to financial metrics and transaction volumes. The Interquartile Range (IQR) method was selected due to its strong resistance to outliers and its straightforward implementation. This method calculates the range between the first (Q1) and third (Q3) quartiles of the data, defining outliers as any values that lie more than 1.5 times the IQR below Q1 or above Q3. It defines outliers based on the quartile distribution of the data, which makes it less sensitive to extreme values than methods based on mean and standard deviation. This method is particularly effective for data that may not follow a normal distribution, a common characteristic of financial and sales data. Specifically, outliers were removed from the ‘Sales’ and ‘Profit’ columns, which are critical to understanding the factors influencing the company's profitability.

The dataset also underwent preliminary cleaning steps, including checks for null entries and duplicates, to ensure data integrity. Additionally, entries marked as returned were removed from the dataset. This step is crucial because returned orders do not contribute to net profitability; they can distort the analysis aimed at understanding factors that drive successful sales. By excluding these transactions, the dataset now more accurately reflects sales that positively impact the company's profits, providing a cleaner basis for identifying strategies to increase profitability.

# Check for missing values df.info() # Remove duplicates df.drop_duplicates(inplace=True) # Using IQR method to filter outliers def remove_outliers(df, column_names): cleaned_df = df.copy() for column in column_names: Q1 = cleaned_df[column].quantile(0.25) Q3 = cleaned_df[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Filtering out the outliers before_rows = cleaned_df.shape[0] cleaned_df = cleaned_df[(cleaned_df[column] >= lower_bound) & (cleaned_df[column] <= upper_bound)] after_rows = cleaned_df.shape[0] print(f"Removed {before_rows - after_rows} outliers from {column}.") return cleaned_df # Columns to check for outliers columns_to_check = ['Sales', 'Profit'] # Removing outliers cleaned_df = remove_outliers(df, columns_to_check) # Directly remove returned entries before_rows = cleaned_df.shape[0] cleaned_df = cleaned_df[cleaned_df['Returned'] != True] after_rows = cleaned_df.shape[0] print(f"Removed {before_rows - after_rows} returned entries.") # Summary of the cleaned data cleaned_df.describe()

Run to view results

After the cleaning process, a significant portion of the dataset was retained, indicating a balanced approach to outlier removal that preserved the bulk of the data for analysis. The summary statistics of the cleaned dataset reveal insightful trends and distributions across the remaining data. For instance, average sales and profits can now be interpreted with greater confidence, reflecting a more accurate depiction of the company's typical transactions. Sales and profit ranges, along with measures like mean and median, reveal the company's financial health and operational efficiency.

Data Visualisation

Visualising the data provides compelling insights into the underlying trends and patterns that numbers alone cannot convey. Through the charts and graphs, we gain insights into the sales performance, product category preferences, and regional market strengths within the dataset. They bridge data and decision-making, showing success and growth opportunities in an accessible way.

The bar chart below shows us the distribution of products across different categories, providing insights into which categories are most prevalent in the dataset.

import seaborn as sns import matplotlib.pyplot as plt # Bar chart for product categories plt.figure(figsize=(10, 6)) sns.countplot(x='Category', data=cleaned_df) plt.title('Distribution of Product Categories') plt.xlabel('Category') plt.ylabel('Frequency') plt.xticks(rotation=45) plt.show()

Run to view results

This plot can help us understand which product categories are most common. The bar chart displays the frequency of products sold across three categories: Office Supplies, Furniture, and Technology. Office Supplies are the most frequently sold products, with sales figures significantly higher than the other two categories. Furniture and Technology have similar sales frequencies, both considerably lower than Office Supplies. This could suggest that customers purchase Office Supplies more frequently. This distribution indicates potential market focus areas and might influence inventory management, marketing strategies, and sales forecasting.

The following chart will aggregate sales by 'Region_no', showing which regions contribute most to sales.

# Visualising Sales per 'Region_no' plt.figure(figsize=(10, 6)) sales_per_region = cleaned_df.groupby('Region_no')['Sales'].sum().reset_index() sns.barplot(x='Region_no', y='Sales', data=sales_per_region) plt.title('Sales per Region') plt.xlabel('Region Number') plt.ylabel('Total Sales') plt.show()

Run to view results

This chart provides a clear view of the sales performance across different regions. The bar chart indicates total sales distribution across four distinct regions. Region 2 outperforms the others, suggesting a strong market presence or effective sales strategies in that area. Region 1 has the lowest sales, which might imply challenges or untapped market potential. The disparity between the regions suggests that targeted strategies could be necessary to bolster sales in the underperforming regions. Moreover, the chart could prompt a review of Region 2's practices to identify successful tactics that might be replicated in Regions 1, 3, and 4 to enhance overall sales.

The scatter plot below explores the relationship between sales and profit, which is crucial for understanding profitability dynamics.

# Scatter plot for Sales vs. Profit plt.figure(figsize=(10, 6)) sns.scatterplot(x='Sales', y='Profit', data=cleaned_df) plt.title('Sales vs. Profit') plt.xlabel('Sales') plt.ylabel('Profit') plt.show()

Run to view results

This scatter plot helps identify patterns or correlations between sales and profit. The scatter plot of Sales vs. Profit suggests a positive relationship between sales and profit up to a point, after which profit increases at a slower rate.This could indicate diminishing returns on higher sales volumes. There is a noticeable trend of transactions with negative profits across various sales levels, highlighting the occurrence of losses or low-margin sales. Clusters of data points near the origin reveal that most sales transactions are of lower value, with a mix of profitable and unprofitable sales.

Modelling Strategy

The random forest regression method was selected to advance the exploration of data through modelling (Tan et al., 2005). This learning technique is particularly adept at predicting outcomes like profitability by constructing numerous decision trees and averaging their predictions. Such an approach is particularly advantageous in our context, where the interactions between variables like sales, quantity, and discount are complex and potentially non-linear.

The choice of Random Forest is informed by its ease of handling a variety of data types. It offers deep insights through its feature importance rankings, which is extremely valuable for identifying the factors that have the greatest impact on profit (Kuhn and Johnson, 2013). Additionally, it has the advantage of reducing variance and avoiding overfitting that simpler models may experience.

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score # Selecting features and target variable X = cleaned_df[['Sales', 'Quantity', 'Discount', 'Category_no', 'Sub-Category_no']] y = cleaned_df['Profit'] # Split the data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=48) # Create and fit the Random Forest Regression model rf_model = RandomForestRegressor(n_estimators=95, random_state=55) rf_model.fit(X_train, y_train) # Make predictions on the test set y_pred = rf_model.predict(X_test) # Evaluating the model rmse = np.sqrt(mean_squared_error(y_test, y_pred)) r2 = r2_score(y_test, y_pred) print(f"Root Mean Squared Error (RMSE): {rmse}") print(f"R-squared: {r2}") # Interpreting the results print("Feature Importances:") for feature, importance in zip(X.columns, rf_model.feature_importances_): print(f"{feature}: {importance}") # Residual Analysis Plot residuals = y_test - y_pred plt.figure(figsize=(10, 6)) plt.scatter(y_pred, residuals, alpha=0.5) plt.axhline(y=0, color='r', linestyle='--') plt.xlabel('Predicted Values') plt.ylabel('Residuals') plt.title('Residual Plot') plt.show()

Run to view results

The Random Forest model has an R-squared of 0.8487, which means that it can explain approximately 84.87% of the profit variability. This suggests a relatively strong fit to the data. The Root Mean Squared Error (RMSE) of 5.51 indicates, on average, how much the predictions deviate from the actual profit values in the same units of the target variable.

From the Random Forest model, we can see that ‘Sales’ has the highest importance score, indicating it is the most significant predictor of profit. This aligns with financial theory which suggests that sales volume is a primary driver of profit margins. ‘Discount’ also shows substantial importance, though less than sales, implying that the level of discounting has a sizable impact on profitability. Sub-Category_no’ and ‘Quantity’ have lower importance scores compared to ‘Sales’ and ‘Discount’, yet they are still meaningful contributors to profit prediction. This suggests that different sub-categories have varying profitability and that the quantity sold also plays a role, albeit smaller than sales and discounts.

The residual plot shows a relatively even distribution of residuals above and below the zero line. However, the spread of residuals seems to increase as the predicted values increase. This might mean that for higher sales values, the model becomes less precise. By focusing solely on numerical predictors, we have streamlined our analysis but at the expense of potentially overlooking rich insights that categorical variables might offer.

For the company, the clear indicator is to focus on sales enhancement strategies while monitoring the impact of discounts. Given the significant impact of ‘Sales’, efforts such as market expansion, sales promotions, and customer retention could be beneficial. A nuanced discounting strategy that does not undercut profitability could also be developed. Furthermore, the insights from the ‘Sub-Category_no’ highlight the importance of product mix optimisation. The company should analyse the profitability of each sub-category in more detail to prioritise high-margin products.

Conclusion

When presenting these findings to the client, it's important to maintain transparency regarding these analytical boundaries. While the model sheds considerable light on the elements that drive profitability, its interpretation should be framed within these acknowledged constraints. To ensure the model remains reliable for business strategy, ongoing refinement and updates with new data are necessary. This allows the company to pivot and adapt with confidence backed by data.

In conclusion, the Random Forest model has demonstrated its utility in identifying key profitability drivers. With a strong R-squared value, it proves to be a reliable tool for the company's forecasting and strategic planning (Kuhn and Johnson, 2013). Moving forward, the company is well-equipped to leverage these insights to enhance its sales strategies, optimise its discounting policies, and prioritise high-margin product offerings. By continuously iterating and refining its analytical approach, the company can position itself for sustained growth and profitability.

References

Iglewicz, B. and Hoaglin, D.C. (1993). How to detect and handle outliers. Milwaukee, WI: ASQC Quality Press.

James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An introduction to statistical learning: with applications in R. New York, NY: Springer.

Kuhn, M. and Johnson, K. (2013). Applied predictive modeling. New York, NY: Springer.

Provost, F. and Fawcett, T. (2013). Data science for business: What you need to know about data mining and data-analytic thinking. Sebastopol, CA: O'Reilly Media.

Tan, P.N., Steinbach, M. and Kumar, V. (2005). Introduction to data mining. Boston, MA: Pearson Addison Wesley.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Coursework 2

Introduction

Importing the Data

Cleaning the Data and Removing Outliers

Data Visualisation

Modelling Strategy

Conclusion

References

Coursework 2