Big Data Analytics Coursework 2

Radhika Patil - 13-03-2024

Introduction

Following Consultancy Report is aimed at driving profitability for given Superstore. The analysis is done by deep diving into the historical sales records of the superstore and using advanced analytical techniques and domain understanding to transform complex data patterns into actionable insights, underlining the factors that substantially influence profitability. The data has been meticulously cleansed, and examined to highlight elements that catalyse profit maximisation and to identify under performing areas where focused improvements can facilitate growth.

#importing necessary packages import pandas as pd import matplotlib import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_score from sklearn.preprocessing import PolynomialFeatures from datetime import datetime from sklearn.tree import DecisionTreeRegressor

Run to view results

#importing data "dataset_Superstore.csv" and saving it to object superstore superstore=pd.read_csv("dataset_Superstore.csv")

Run to view results

Data Preparation

#renaming the columns, replacing ' '(space) with '_'(underscore) for easy column accessibility superstore = superstore.rename(columns = { 'Row ID':'Row_ID' ,'Order ID':'Order_ID' ,'Order Date':'Order_Date' ,'Ship Date':'Ship_Date' ,'Ship Mode':'Ship_Mode' ,'Customer ID':'Customer_ID' ,'Customer Name':'Customer_Name' ,'Postal Code':'Postal_Code' ,'Product ID':'Product_ID' ,'Sub-Category':'Sub_Category' ,'Sub-Category_no':'Sub_Category_no' ,'Product Name':'Product_Name' ,'Product Name_no':'Product_Name_no' })

Run to view results

#data type conversion of date fields superstore['Order_Date'] = pd.to_datetime(superstore['Order_Date'], dayfirst=True) superstore['Ship_Date'] = pd.to_datetime(superstore['Ship_Date'], dayfirst=True)

Run to view results

Data Cleaning

superstore.info() #Checking Data Types superstore.isnull().sum() #Checking for null values

Run to view results

There are no missing values in the data. The data types seem to be accurate.

#Checking unique values for data entry errors superstore['Country'].unique() superstore['State'].unique() superstore['City'].unique() superstore['Sub_Category'].unique()

Run to view results

Column 'Country' can be ignored as it has only one unique value. There seem to not be any data entry error.

Outliers

We move on to outlier detection and removal as this step is crucial for maintaining data integrity and model accuracy, making the analyses more reliable.

#Scatter plot for itentifying outliers in Quantity fig, ax = plt.subplots() ax.scatter(superstore['Quantity'],superstore['Row_ID']) plt.title('Scatter plot Quantity vs Rows (Sales Instances)') plt.xlabel('Quantity') plt.ylabel('Row_ID') plt.show()

Run to view results

There are clear outliers where quantity is 10000.

#Handling Quantity outliers #count of outlier rows print(len(superstore[superstore['Quantity']> 2000])) #removing Quantity outliers based on observations from above scatter plot superstore=superstore[superstore['Quantity']< 2000] #scatter plot after removing Quantity outliers fig, ax = plt.subplots() ax.scatter(superstore['Quantity'],superstore['Row_ID']) plt.title('Scatter plot Quantity vs Rows (Sales Instances)') plt.xlabel('Quantity') plt.ylabel('Row_ID') plt.show()

Run to view results

The plot's density appears to be uniform across different quantities, but there are some quantities with notably more transactions, as indicated by denser clusters of points (e.g., quantities 2, 3, 5, and 7).

#Identifying outliers in profit fig, ax = plt.subplots() ax.scatter(superstore['Profit'],superstore['Row_ID']) plt.title('Scatter plot Profit vs Rows (Sales Instances)') plt.xlabel('Profit Units') plt.ylabel('Row_ID') plt.show()

Run to view results

Presence of outliers is evident from above graph.

#Binning profit data to study its distribution for handling outliers data = superstore['Profit'] df = pd.DataFrame(data) # Binning the 'profit' data into categories bins = [-500, -400, -300, -200, -100, 0, 100, 200, 300, 400, 500 ] labels = [-500, -400, -300, -200, -100, 0, 100, 200, 300, 400] df['profit_category'] = pd.cut(df['Profit'], bins=bins, labels=labels) # Creating a histogram of the 'profit' data plt.hist(df['Profit'], bins=bins, edgecolor='black') plt.title('Profit Distribution') plt.xlabel('Profit Units') plt.ylabel('Frequency') plt.xticks(bins) plt.show()

Run to view results

The data seems to be more normally distributed with very high peak. Therefore, mean should be used to detect outliers.

#Handling Profit outliers #Mean value for Profit print(superstore['Profit'].mean()) #count of outlier rows print(len(superstore[(superstore['Profit'] < superstore['Profit'].mean()-2000)| (superstore['Profit'] > superstore['Profit'].mean()+2000)])) #Removing outliers from dataset superstore=superstore[(superstore['Profit'] > superstore['Profit'].mean()-2000)&(superstore['Profit'] < superstore['Profit'].mean()+2000)] #Profit vs Row_ID Scatter Plot fig, ax = plt.subplots() ax.scatter(superstore['Profit'],superstore['Row_ID']) plt.title('Scatter plot Profit vs Rows (Sales Instances)') plt.xlabel('Profit Units') plt.ylabel('Row_ID') plt.show()

Run to view results

There's a dense concentration of points around the profit range slightly above and below zero units. This suggests that most transactions result in a small profit or loss.

#Identifying outliers in Discount fig, ax = plt.subplots() ax.scatter(superstore['Discount'],superstore['Row_ID']) plt.title('Scatter plot Discount vs Rows (Sales Instances)') plt.xlabel('Discount') plt.ylabel('Row_ID') plt.show()

Run to view results

There seems to be noisy data with discounts exceeding 100%.

#Handling noisy data in Discounts #Count of noisy data rows print(len(superstore[(superstore['Discount'] > 1)])) #Cleaning superstore data for discount noise superstore=superstore[(superstore['Discount'] < 1)] #Discount vs Row_ID Scatter Plot fig, ax = plt.subplots() ax.scatter(superstore['Discount'],superstore['Row_ID']) plt.title('Scatter plot Discount vs Rows (Sales Instances)') plt.xlabel('Discount') plt.ylabel('Row_ID') plt.show()

Run to view results

The scatter of points at higher discount levels, such as 0.5 (50%) and above, suggests that these substantial discounts are not rare occurrences. Such heavy discounting could potentially have a large impact on profitability .There is also a significant number of sales with no discount applied, as indicated by the dense vertical band at the 0.0 discount level.

#Looking for outliers in Sales data #Sales vs Row_ID Scatter Plot fig, ax = plt.subplots() ax.scatter(superstore['Sales'],superstore['Row_ID']) plt.title('Scatter plot Sales vs Rows (Sales Instances)') plt.xlabel('Sales Units') plt.ylabel('Row_ID') plt.show()

Run to view results

Outliers seem to be present after 5000 units of sales.

#Binning sales data to study its distribution for handling outliers data = superstore['Sales'] df = pd.DataFrame(data) # Binning the 'Sales' data into categories bins = [0, 100, 200, 300, 400, 500 ] labels = [0, 100, 200, 300, 400] df['Sales_category'] = pd.cut(df['Sales'], bins=bins, labels=labels) # Creating a histogram of the 'Sales' data plt.hist(df['Sales'], bins=bins, edgecolor='black') plt.title('Sales Distribution') plt.xlabel('Sales Units') plt.ylabel('Frequency') plt.xticks(bins) plt.show()

Run to view results

As the data distribution is skewed, median should be used for outlier handling.

#Handling outliers in Sales #Median print(superstore['Sales'].median()) #count of sales outlier print(len(superstore[(superstore['Sales'] > superstore['Sales'].median()+5000)])) #removing outliers superstore=superstore[(superstore['Sales'] < superstore['Sales'].median()+5000)] #Sales vs Row_ID Scatter Plot fig, ax = plt.subplots() ax.scatter(superstore['Sales'],superstore['Row_ID']) plt.title('Scatter plot Sales vs Rows (Sales Instances)') plt.xlabel('Sales Units') plt.ylabel('Row_ID') plt.show()

Run to view results

The concentration of points at the lower end of the sales range might suggest focusing on strategies to either increase the frequency of higher-value sales or enhance the profitability of the more common, lower-value transactions.

Understanding Data

Summary Statistics

#Summary statistics for Continuous Features superstore[['Sales','Quantity','Discount','Profit']].describe()

Run to view results

Sales: A relatively high standard deviation of 424.50 compared to the mean of 209.43 suggests significant variability in sales amounts, with some transactions being much higher or lower than the average. The 25th, 50th (median), and 75th percentiles suggest a right-skewed distribution of sales, with most transactions being on the lower end of the scale.

Quantity: On average, each transaction includes about 3.78 items with a small standard deviation of 2.22 relative to the mean which suggests most transactions include a small number of items.

Discount: Discounts range from 0% (no discount) to 80% (a significant discount), which could indicate clearance sales or special promotions.

Profit: The large standard deviation of 132.10 compared to the mean profit of 25.50 suggests high variability in profitability, from losses to significant gains. The 50th percentile (median) is lower than the mean, which again indicates a right-skewed distribution with a few high-profit transactions pulling the average above the median.

#Summary statistics for Categorical Features superstore[['Ship_Mode','Segment','City', 'State','Postal_Code', 'Region','Product_ID', 'Category','Sub_Category','Product_Name' ]].describe(include='object').transpose()

Run to view results

The summary suggest a focus on ‘Standard Class’ shipping and the ‘Consumer’ segment to maximise profits, as they are most frequent. Leveraging ‘Office Supplies’ in high-volume locations like ‘California’ and ‘New York City’ could be key. Prioritising product offerings such as ‘Binders’ and ‘Staple envelopes’, which are popular, may improve profitability.

Exploratory Data Analysis

Evaluating Product Performance

#Top Profitable Product Subcategories a=superstore[['Sub_Category','Sales','Quantity','Profit']].groupby('Sub_Category').sum().reset_index() a.sort_values('Profit',ascending=False).head(5)

Run to view results

Most sub-categories are profitable, with two showing exceptionally high profits. Phones and Accessories stand out with the highest profit, suggesting key drivers of profitability and should be a focus for sales and marketing efforts.

#Loss Making Product Subcategories a=superstore[['Sub_Category','Sales','Quantity','Profit']].groupby('Sub_Category').sum().reset_index() a[a['Profit'] < 0].sort_values('Profit',ascending=True)

Run to view results

Conversely, some sub-category are significantly under performing, incurring large losses. These products requires immediate attention to identify issues related to cost, pricing, or demand.

Strategic actions could include promoting high-profit sub-categories, re-evaluating the pricing strategy, and possibly discontinuing or revamping the loss-making sub-category to optimise overall profit.

Highlighting Loss making locations

#Highlighting Loss making locations print('Major loss making states') a=superstore[['State','Profit']].groupby(['State']).sum('Profit').reset_index() print(a[a['Profit'] < 0].sort_values('Profit',ascending=True).head(3)) print('Major loss making cities') a=superstore[['City','Profit']].groupby(['City']).sum('Profit').reset_index() print(a[a['Profit'] < 0].sort_values('Profit',ascending=True).head(3)) print('Major loss making Post Code') a=superstore[['Postal_Code','Profit']].groupby(['Postal_Code']).sum('Profit').reset_index() print(a[a['Profit'] < 0].sort_values('Profit',ascending=True).head(3))

Run to view results

It is important to recognise loss making areas so that campaigns or promotional activities can be carried out there to expand business and sustain profits.

Studying Categorical variables against profit

#'Segment', 'Region', and 'Category' # Aggregate profit for each 'Segment', 'Region', and 'Category' profit_by_segment = superstore.groupby('Segment')['Profit'].sum().reset_index() profit_by_region = superstore.groupby('Region')['Profit'].sum().reset_index() profit_by_category = superstore.groupby('Category')['Profit'].sum().reset_index() # Set up the matplotlib figure and axes fig, axs = plt.subplots(1, 3, figsize=(18, 6)) # Bar graph for Profit vs Segment axs[0].bar(profit_by_segment['Segment'], profit_by_segment['Profit'], color='skyblue') axs[0].set_title('Profit by Segment') axs[0].set_xlabel('Segment') axs[0].set_ylabel('Profit Units') axs[0].tick_params(axis='x', rotation=45) # Bar graph for Profit vs Region axs[1].bar(profit_by_region['Region'], profit_by_region['Profit'], color='lightgreen') axs[1].set_title('Profit by Region') axs[1].set_xlabel('Region') axs[1].tick_params(axis='x', rotation=45) # Bar graph for Profit vs Category axs[2].bar(profit_by_category['Category'], profit_by_category['Profit'], color='salmon') axs[2].set_title('Profit by Category') axs[2].set_xlabel('Category') axs[2].tick_params(axis='x', rotation=45) # Adjust layout so labels and titles do not overlap plt.tight_layout() # Show the plot plt.show()

Run to view results

The data reveals the Consumer segment, West region, and Technology category as the most profitable areas, suggesting targeted investment and expansion there.

The Corporate segment and East region show potential for growth, warranting further analysis and tailored strategies.

The Home Office segment, Central and South regions, and Furniture and Office Supplies categories lag behind, necessitating a review of operations, cost structures, and market strategies.

Region wise Category analysis for Profits

#Region wise Category analysis for Profits furniture_profit = superstore[superstore['Category'] == 'Furniture'].groupby('Region')['Profit'].sum() office_supplies_profit = superstore[superstore['Category'] == 'Office Supplies'].groupby('Region')['Profit'].sum() technology_profit = superstore[superstore['Category'] == 'Technology'].groupby('Region')['Profit'].sum() # Set up the matplotlib figure and axes fig, axs = plt.subplots(1, 3, figsize=(15, 5), sharey=True) # First bar graph for Furniture axs[0].bar(furniture_profit.index, furniture_profit.values, color='skyblue') axs[0].set_title('Furniture Profit by Region') axs[0].set_xlabel('Region') axs[0].set_ylabel('Profit') axs[0].tick_params(axis='x', rotation=45) # Second bar graph for Office Supplies axs[1].bar(office_supplies_profit.index, office_supplies_profit.values, color='lightgreen') axs[1].set_title('Office Supplies Profit by Region') axs[1].set_xlabel('Region') axs[1].tick_params(axis='x', rotation=45) # Third bar graph for Technology axs[2].bar(technology_profit.index, technology_profit.values, color='salmon') axs[2].set_title('Technology Profit by Region') axs[2].set_xlabel('Region') axs[2].tick_params(axis='x', rotation=45) # Adjust layout so labels and titles do not overlap plt.tight_layout() # Show the plot plt.show()

Run to view results

The West region consistently shows strength across all categories, hinting at a successful regional strategy.

The disparity in profits, especially in office supplies, suggests opportunities for cross-regional learning and strategy adaptation. The furniture category may need a comprehensive review of cost, sales strategy, or customer preference to enhance profitability.

Technology's balanced profit suggests stable demand but also hints at the potential for targeted growth strategies in specific regions.

Furniture category in the Central region seem to be a problem area as it loss making, demanding review.

Region wise Segment analysis for Profits

#Region wise Segment analysis for Profits consumer_profit = superstore[superstore['Segment'] == 'Consumer'].groupby('Region')['Profit'].sum() corporate_profit = superstore[superstore['Segment'] == 'Corporate'].groupby('Region')['Profit'].sum() home_office_profit = superstore[superstore['Segment'] == 'Home Office'].groupby('Region')['Profit'].sum() # Set up the matplotlib figure and axes fig, axs = plt.subplots(1, 3, figsize=(15, 5), sharey=True) # First bar graph for Consumer axs[0].bar(consumer_profit.index, consumer_profit.values, color='skyblue') axs[0].set_title('Consumer Profit by Region') axs[0].set_xlabel('Region') axs[0].set_ylabel('Profit') axs[0].tick_params(axis='x', rotation=45) # Second bar graph for Corporate axs[1].bar(corporate_profit.index, corporate_profit.values, color='lightgreen') axs[1].set_title('Corporate Profit by Region') axs[1].set_xlabel('Region') axs[1].tick_params(axis='x', rotation=45) # Third bar graph for Home Office axs[2].bar(home_office_profit.index, home_office_profit.values, color='salmon') axs[2].set_title('Home Office Profit by Region') axs[2].set_xlabel('Region') axs[2].tick_params(axis='x', rotation=45) # Adjust layout so labels and titles do not overlap plt.tight_layout() # Show the plot plt.show()

Run to view results

The West region is a strong performer across all segments, indicating effective regional strategies.

Consumer and Corporate segments show potential for growth in the Central and South regions.

The more balanced distribution in the Home Office segment suggests different market dynamics or competitive advantages that may be unique to that segment. These insights could guide regional strategy optimisations and resource allocations.

Derived Features

Derived features, engineered from existing data, are crucial for enhancing machine learning model performance. They can unveil hidden insights by capturing additional information not represented by the raw features, thereby improving model accuracy and aiding in the discovery of more complex patterns in data.

#Creating Derived Features #Days to Ship superstore['Shipping_Time'] = (superstore['Ship_Date'] - superstore['Order_Date']).dt.days #Profit Margin: A feature representing the profit margin on each sale could be useful. #It can be calculated as the profit divided by the sales for each transaction. superstore['Profit_Margin'] = superstore['Profit'] / superstore['Sales'] #Profit per Customer superstore['Profit per Customer'] = superstore.groupby('Customer_ID')['Profit'].transform('sum') #Sales per Order: superstore['Sales_per_Order'] = superstore.groupby('Order_ID')['Sales'].transform('sum') #Discount Rate superstore['Discount Rate'] = superstore['Discount'] / superstore['Sales'] #Order Size: superstore['Order_Size'] = superstore.groupby('Order_ID')['Quantity'].transform('sum') #Product Popularity superstore['Product_Popularity'] = superstore['Product_ID'].map(superstore['Product_ID'].value_counts()) #Average Discount per Category: superstore['Avg_Discount_per_Category'] = superstore.groupby('Category')['Discount'].transform('mean')

Run to view results

Correlation

Raw Continuous Variable Correlation with Profit

# Set up the matplotlib figure and axes, specifying the number of rows and columns fig, axs = plt.subplots(1, 3, figsize=(15, 5)) # 1 row, 3 columns # First scatter plot axs[0].scatter(superstore['Sales'], superstore['Profit']) axs[0].set_title('Sales vs Profit') axs[0].set_xlabel('Sales') axs[0].set_ylabel('Profit') # Second scatter plot axs[1].scatter(superstore['Discount'], superstore['Profit']) axs[1].set_title('Discount vs Profit') axs[1].set_xlabel('Discount') axs[1].set_ylabel('Profit') # Third scatter plot axs[2].scatter(superstore['Quantity'], superstore['Profit']) axs[2].set_title('Quantity vs Profit') axs[2].set_xlabel('Quantity') axs[2].set_ylabel('Profit') # Adjust layout so labels do not overlap plt.tight_layout() # Show the plot plt.show()

Run to view results

Sales vs Profit: Higher sales seem to correlate with increased profit, but there are instances of high sales with low or negative profit, suggesting that higher sales do not always guarantee higher profits.

Discount vs Profit: There is no clear trend suggesting that higher discounts lead to higher profits. In fact, larger discounts seem to occasionally result in significant losses.

Quantity vs Profit: Similar to discounts, there isn’t a straightforward correlation between quantity sold and profit. While higher quantity sales often have positive profit, some high-quantity sales result in losses.

Derived Continuous Variable Correlation with Profit

sns.heatmap(superstore[['Profit','Shipping_Time','Product_Popularity','Profit_Margin','Sales_per_Order','Order_Size']].corr()) plt.show()

Run to view results

->There is a strong positive correlation between profit and profit margin, which is expected since higher margins typically lead to higher profits. ->Product popularity has a moderately negative correlation with profit and profit margin, suggesting that more popular products might not always be the most profitable, possibly due to competitive pricing. ->Sales per order have a noticeable positive correlation with order size, indicating that larger orders tend to generate higher sales. ->Shipping time does not show a strong correlation with profit, suggesting that the efficiency or speed of shipping may not significantly impact profitability. This information can guide strategies to enhance profit margins and reconsider product mix to boost profitability, potentially focusing on less popular but more profitable items, and targeting larger order sizes.

Categorical Variable Correlation with Profit

sns.boxplot(x='Category', y='Profit', data=superstore) plt.title('Profit Distribution by Category') plt.show()

Run to view results

The spread of the profit data within each category, particularly the inter-quartile range, suggests there is significant variability in profitability across categories. This could imply that the category a product belongs to might influence the variability of its profitability.

Technology is potentially the most profitable category but with a high variability, suggesting some sales are highly profitable while others may not be. Furniture has the lowest median profitability and also a wide range of profit outcomes, indicating inconsistency in profits. Office Supplies, while not achieving the high-profit values of Technology, indicates steady and consistent profits. To improve overall profitability, strategies could focus on increasing sales of high-margin technology products while improving the profit consistency in Furniture and maintaining the steady gains in Office Supplies.

Modelling

Customer Analysis using RFM

Clustering Customers to understand how various group influences profit and by how much.

# Setting the current date to the day after the latest date in dataset for recency calculations current_date = superstore['Order_Date'].max() + pd.Timedelta(days=1) # Calculating Recency, Frequency, Monetary values for each customer rfm = superstore.groupby('Customer_ID').agg({ 'Order_Date': lambda x: (current_date - x.max()).days, 'Order_ID': 'nunique', 'Sales': 'sum', 'Profit': 'sum' }) # Rename the columns rfm.rename(columns={'Order_Date': 'Recency', 'Order_ID': 'Frequency', 'Sales': 'Monetary'}, inplace=True) # Reset index to put Customer_ID back as a column rfm.reset_index(inplace=True) # Display the first few rows of the RFM table print(rfm.head())

Run to view results

#Preparing dataframe df=rfm [['Recency', 'Frequency', 'Monetary']] features = ['Recency', 'Frequency', 'Monetary'] # Scaling the data scaler = StandardScaler() scaled_features = scaler.fit_transform(df[features]) # Determining the optimal number of clusters using the Elbow Method inertia = [] k_values = range(1, 10) for k in k_values: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(scaled_features) inertia.append(kmeans.inertia_) # Plot the elbow graph plt.plot(k_values, inertia, '-o') plt.title('Elbow Method') plt.xlabel('Number of clusters (k)') plt.ylabel('Inertia') plt.show() # Choosing the k that seems best from the elbow graph optimal_k = 3 # Run K-means clustering with the optimal number of clusters kmeans = KMeans(n_clusters=optimal_k, random_state=42) df['Cluster'] = kmeans.fit_predict(scaled_features) rfm['Cluster'] = kmeans.fit_predict(scaled_features) # Checking each cluster size print(df['Cluster'].value_counts()) #Aggregating features within each cluster to see the average values cluster_profiles = df.groupby('Cluster')[features].mean() print(cluster_profiles)

Run to view results

from sklearn.metrics import calinski_harabasz_score ch_index = calinski_harabasz_score(scaled_features, kmeans.labels_) print(f'Calinski-Harabasz Index: {ch_index}')

Run to view results

The Calinski-Harabasz Index, also known as the Variance Ratio Criterion, is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters. Higher values generally indicate a model with better defined clusters. A score of 523.91 suggests that the clustering model has done a reasonably good job of creating well-defined and separated clusters for the dataset.

Studying customer relation with profit.

sns.boxplot(x='Cluster', y='Profit', data=rfm) plt.title('Profit Distribution by Cluster') plt.show()

Run to view results

Cluster 0's low-profit margin implies customers with infrequent or small purchases, suggesting a need for strategies to boost their spending.

Cluster 1 shows variable profits, with certain customers making occasional high-value purchases, indicating an opportunity for targeted promotions to increase their shopping frequency.

Cluster 2, with the highest median profit, consists of the superstore's most valuable customers, who are likely frequent buyers or spend significant amounts per transaction; these customers are prime candidates for loyalty and retention programs. Outliers in Clusters 0 and 1, where profits sometimes dip into losses or surge, require further analysis to minimise losses and capitalise on high-profit sales.

Tailoring strategies to each segment's behaviour and value can elevate the superstore’s profitability, by enhancing customer engagement and optimising sales tactics.

Modelling

In order to understand key drivers behind a superstore's profitability, various models were employed including decision trees and linear regression. However, most models gave high Mean Squared Error (MSE) and low R-squared (R^2) values. Low scores while using linear regression, in particular encouraged investigation w.r.t. the nonlinear dynamics within the data. Consequently, employing Polynomial Features Transformation enhanced the models' performance giving best results at degree 2 hinting at presence of relationships of features influencing profitability beyond linear interactions.

#Regression Model #Considering relavent features for model a=superstore[['Segment','State','Region','Ship_Mode','Sub_Category_no','Category', 'Discount', 'Profit','Sales', 'Order_Size','Sales_per_Order','Discount Rate','Avg_Discount_per_Category','Shipping_Time','Product_Popularity']] #Encoding categorical variable superstore_encoded = pd.get_dummies(a, columns=['Segment', 'State','Region','Ship_Mode','Sub_Category_no','Category']) # Drop the target variable to isolate features x = superstore_encoded.drop('Profit', axis=1) # The target variable y = superstore_encoded['Profit'] #Polynomial Features Transformation degree = 2 # Degree of the polynomial poly_features = PolynomialFeatures(degree=degree) x_poly = poly_features.fit_transform(x) #Linear Regression Model model = LinearRegression() model.fit(x_poly, y) y_pred = model.predict(x_poly) #Evaluate the Model mse = mean_squared_error(y, y_pred) r2 = r2_score(y, y_pred) print(f'MSE: {mse}') print(f'R^2: {r2}')

Run to view results

The metrics indicate a linear regression model with polynomial features of degree 2 performed well in predicting outcomes, evidenced by an MSE of 1405.54 and an R^2 of 0.919. The relatively low MSE suggests predictions are close to actual values, indicating good model accuracy. The high R^2 value signifies that approximately 91.9% of the variance in the dependent variable is explained by the model, showcasing its effectiveness in capturing the relationship between variables. This model's success implies that incorporating polynomial features has significantly enhanced its predictive power, making it a robust tool for forecasting and decision-making.

# Coefficients and Intercept coefficients = model.coef_ intercept = model.intercept_ # Printing the results print(f'Intercept: {intercept}') print('Coefficients:') for i, coef in enumerate(coefficients): if coef>0 : print(f'Coefficient of x^{i} is: {coef}')

Run to view results

Model coefficients in linear regression indicate the strength and direction of the relationship between each predictor variable and the target variable. A coefficient shows how much the target variable is expected to change when the predictor variable changes by one unit, holding other variables constant.

The magnitude of a coefficient indicates the strength of the relationship between the corresponding feature and the target variable while the sign indicates direction.

Conclusion: By evaluating the intricate relationships between product categories, discounts, customer demographics, and purchasing behaviours, specific areas that can be leveraged to amplify the store's financial performance have been highlighted. Overall, leveraging strengths in high-profit areas while improving under performing ones through strategic adjustments could drive comprehensive business growth and enhance profitability.

References:

Barnett, V. and Lewis, T., 1994. Outliers in Statistical Data. 3rd ed. Chichester: John Wiley & Sons

Hughes, A.M., 1996. ‘Recency, Frequency, and Monetary Value in: Selection of Direct Marketing Customers’, International Journal of Selection and Assessment, 4(3), pp. 114-123

Sharda, R., Delen, D., & Turban, E. (2020). Analytics, data science, & artificial intelligence: Systems for decision support. Pearson Education, Inc. 11th Edition.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Big Data Analytics Coursework 2

Introduction

Data Preparation

Data Cleaning

Outliers

Understanding Data

Summary Statistics

Exploratory Data Analysis

Evaluating Product Performance

Highlighting Loss making locations

Studying Categorical variables against profit

Region wise Category analysis for Profits

Region wise Segment analysis for Profits

Derived Features

Correlation

Raw Continuous Variable Correlation with Profit

Derived Continuous Variable Correlation with Profit

Categorical Variable Correlation with Profit

Modelling

Customer Analysis using RFM

Studying customer relation with profit.

Modelling

Big Data Analytics Coursework 2