Coursework 2: C11BD

Chirag Lamba

H00454315

Big data analytics (C11BD)

Introduction

In this assignment, we are given a data set, Superstore.csv, which is a sample of data provided by a company aimed at improving its profits through data analytics. Here we will explore the field of big data analytics and use advanced techniques to get insights that can be applied to increase profitability. Every entry in our dataset is actually a distinct transaction within a particular time period, based on which we can investigate the patterns, trends, and focus areas.

Dataset Analysis

This notebook analyzes the Superstore.csv data to understand factors impacting profit.

First, let's start by importing the necessary libraries and loading the dataset.

# Import libraries for data visualization import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the dataset into the dataframe data = pd.read_csv("Superstore.csv")

Run to view results

Explanation

We import pandas (pd) for data manipulation and analysis.

We import numpy (np) for numerical computations.

We took out the import of libraries for data visualization (matplotlib.pyplot and seaborn).

Data Cleaning

Next we take a look at the first few rows of the dataset to understand its structure and identify any potential data entry errors.

#Display the first few rows of the dataset to understand its structure data.head()

Run to view results

Having understood the structure of the dataset we proceed then with data cleaning. This will include detecting as well as correcting the missing values, outliers, and other inconsistencies in the data.

First of all we’ll check for any missing values, handling and correcting them.

# Checking for missing values missing_values = data.isnull().sum() print(missing_values) # Explore missing values further most_missing = data.isnull().sum().sort_values(ascending=False) print("Columns with most missing values:") print(most_missing.head()) # Checking for duplicates print(data.duplicated().sum()) df = data.drop_duplicates() # Drop rows with missing values data = data.dropna() data.shape # Remove rows with negative profit data = data[data['Profit'] >= 0]

Run to view results

Explanation

We use the above block to get a count of missing values in each column. The second part allows us to explore further by identifying the columns with the most values have any missing value errors (if they are present).

As projected from the above, none of the cells in our data contain missing values. With no missing values to handle, we can proceed with the data analysis without needing to account for missing data.

Next, we'll identify and handle any data entry errors. This might include typos, inconsistencies, or incorrect data types. Understanding data types is crucial for working with data effectively.

data_types = data.dtypes print(data_types)

Run to view results

Identifying Outliers

Outliers are data points that significantly differ from other observations in the dataset. We can identify outliers by visualizing the data using histograms, box plots, or scatter plots.

# Explore outliers in 'profit' column import matplotlib.pyplot as plt import seaborn as sns # Calculate quartiles and IQR for relevant columns Q1 = data['Profit'].quantile(0.25) Q3 = data['Profit'].quantile(0.75) IQR = Q3 - Q1 # Remove outliers using IQR method lower_bound = Q1 - 1.5*IQR upper_bound = Q3 + 1.5*IQR data = data[(data['Profit'] >= lower_bound) & (data['Profit'] <= upper_bound)] # Visualize profit distribution after removing outliers plt.figure(figsize=(8, 6)) sns.histplot(data['Profit'], kde=True) plt.title('Distribution of Profit after Removing Outliers (IQR Method)') plt.xlabel('Profit') plt.ylabel('Count') plt.show()

Run to view results

Explanation

The above histogram plot shows the distribution of profit after removing outliers using the IQR method. From the visualization we derive:

Profits tend to be concentrated in a lower-to-moderate range, with some instances of higher profits but less frequently.

The data shows a fairly right-skewed distribution, where the tail spreads more to the upper levels of profit. Therefore, it is possible that the proportion of low profits is higher than high profits.

# Visualize outliers using box plots for sales and quantity plt.figure(figsize=(12, 6)) sns.boxplot(data=data[['Sales', 'Quantity', 'Discount']]) plt.title('Sales Vs Quantity and Discount') plt.show()

Run to view results

Interpretation:

We have removed profits that have a negative value herein. The boxplot visualization shows the distribution of three variables: sales, quantity, and discount

Sales: It seems that the distribution of sales data is right-skewed, given that the whisker is longer and extends towards higher sales values. Such a result hints that there isn't a wide drop in sales and most sales are concentrated around a certain amount.

Quantity: Visually, the quantitative data looks more or less symmetrical, with the median lying near the center of the box. The whiskers are not too distant from each other on both high and low end, that is an indication that there are outliers on all sides.

Discount: The discount data also appears somewhat right-skewed, with a longer whisker towards higher discounts. This means that there could probably be some highly discounted transactions in comparison to the rest of the transactions.

For further additional analysis, we calculate the average profit for each product category and visualize them.

# Calculate average profit by category avg_profit_by_category = data.groupby('Category')['Profit'].mean() print(avg_profit_by_category) # Get category names and average profits as lists categories = avg_profit_by_category.index.tolist() average_profits = avg_profit_by_category.tolist() # Create bar chart plt.bar(categories, average_profits) # Customize chart appearance plt.xlabel("Product Category") plt.ylabel("Average Profit") plt.title("Average Profit by Product Category") plt.xticks(rotation=45, ha="right") # Rotate x-axis labels for better readability plt.tight_layout() # Adjust layout for clarity plt.show()

Run to view results

Explanation:

Technology appears to be the most profitable product category based on the highest average profit at $29.2 as compared to Furniture (23.7) and Office supplies (15.2).

After cleaning the data, it's important to validate the changes made and ensure that the dataset is now ready for further analysis.

# summary statistics after cleaning summary_stats = data.describe() print(summary_stats)

Run to view results

Explanation:

Total number of rows after cleaning is 7174.

Customers: As depicted from the summary, we can interpret the data about total customers. The firm has an average number of 351 customers per se spread out across the total product categories.

Sales: This column likely refers to the total sales amount per customer order. For instance, we can see that the average sale amount is $104.6, with a standard deviation of $184.88, which is much larger than the mean, indicating a high variability in sales amounts. This suggests a large variation in sales amounts, with some sales much higher than the average and some much lower.

The minimum sale amount is very low ($0.99), while the maximum sale amount is very high ($4164.05). This further highlights the large variation in sales amounts.

Discount: The average discount is 8.54% and max discount offered is 12%. The standard deviation is 0.10, which is very small as compared to the mean. This suggests that the discount amounts are fairly consistent.

Profit: Most importantly, the average Profit is $19.05. The standard deviation is $21 and that is significantly higher than the mean. This points to a fact that profit margins differ greatly between various sales. We attracted a max profit of $92.50 per unit sale which is fair.

Quantity: The average quantity sold is 6. This seems very low and might be an error in the data.

We have successfully imported and cleaned the dataset. Now we can proceed onto modelling the data for further analysis.

Exploratory Data Analysis

# Categorical plotting (Bar Chart)- Segment wise profit segment_profit = data.groupby('Segment')['Profit'].sum() segment_profit.plot(kind='bar', xlabel='Segment', ylabel='Total Profit', title='Profit by Segment') plt.show() # Categorical plotting (Bar Chart) - Segment-wise Sales plt.figure(figsize=(8,6)) sns.barplot(x='Segment', y='Sales', data=data) plt.title('Sales by Segment') plt.show()

Run to view results

Explanation and Results

From the results, it appears that the Consumer segment has the highest total profit and highest sales volumes, followed by Corporate and Home Office segments. This is intuitive as most of the companies focus on consumers mostly and allocate their resources accordingly.

# Continuous plotting (scatter plot) - Profit vs Sales plt.figure(figsize=(10, 6)) sns.scatterplot(x='Sales', y='Profit', data=data) plt.xlabel('Sales') plt.ylabel('Profit') plt.title('Profit vs Sales') plt.show()

Run to view results

A positive correlation is observed which results in a general trend upwards, showing that as sales grow, profit grows too. On the other hand, the data points also scatter, showing that the profits do not always vary in proportion to the sales growth. There are some data points that show high sales but low profit and low sales but high profit.

Data Modelling

Linear Regression Analysis

We will use linear regression to analyze the factors that affect profitability. Linear regression is the best approach in this case as our target variable (profit) is a continuous variable.

from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score, mean_squared_error # Defining x and y axis features X = data[['Sales', 'Quantity', 'Discount']] y = data['Profit'] # Training the model X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LinearRegression() model.fit(X_train, y_train) # Calculating RMSE y_pred = model.predict(X_test) print('R-squared:', r2_score(y_test, y_pred)) print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred))) print('Intercept:', model.intercept_) print('Coefficients:') for feature, coef in zip(X.columns, model.coef_): print(f'{feature}: {coef:.2f}') X = data[['Sales', 'Quantity', 'Discount']] y = data['Profit'] # Adding a scatter plot to regression plt.figure(figsize=(10, 6)) sns.scatterplot(x=X['Sales'], y=y, alpha=0.5) # Adding the regression line sns.regplot(x=X['Sales'], y=y, data=data, scatter=False, ci=None, color='r') plt.title('Regression Analysis - Sales vs Profit') plt.xlabel('Sales') plt.ylabel('Profit') plt.show()

Run to view results

Coefficients:

Sales: The coefficient of 0.07 for the feature 'Sales' implies that the predicted 'Profit' grows by 0.07 units for every additional unit of 'Sales' feature, with the rest of the features kept constant. This positive value shows that greater sales, in turn, results in greater profit.

Quantity: This coefficient is zero, which means it is not statistically significant. This could mean that there is no linear relationship between quantity sold and profit. This apparent contradiction, of course, can be explained by observing that more units sold do not necessarily mean higher profits.

Discount: The 'Discount' regression coefficient of -35.52 shows that when 'Discount' goes up by 1 unit, 'Profit' goes down by 35.52 units, while other features are held constant. This is in line with the idea that higher discounts bring about lower profits.

Intercept: The intercept is 15.2, meaning that the predicted value of 'Profit' is zero when all the feature values (with respect to 'Sales', 'Quantity' and 'Discount') are zero.

Visualization: The plot shows a positive correlation between 'Sales' and 'Profit', which is consistent with the positive coefficient for 'Sales' in the linear regression model.

Interpretation and Conclusion

Firstly, the positive coefficient for 'sales' implies that higher sales result in higher profits as previously anticipated.

Next, the positive coefficient for 'quantity' demonstrates that the more units of a given product the firm sells , the higher the profits attracted.

The negative coefficient for 'discount' implies that a higher discount rate decreases profits, so the company should focus on this area.

From these outcomes, we imply that the company should focus on expanding sales and sales volume at the same time to minimize the discount to achieve maximum profits.

The R-squared value means that this model explains the variation in the target variable 'Profit' by approximately 30%. This indicates that the model is not too distant from the reality and rather well matched to the data.

The value of the RMSE of 17.974 indicates the average error of the model in the 'Profit' prediction. A lower RMSE value is better, because it shows that the model's predictions are in the close proximity of the actual values.

In brief, regression analysis indicates that model fits the data fairly and discount variable seems to have the highest impact on profit among the parameters under review.

k-means Clustering

from sklearn.cluster import KMeans k = 3 X = data[['Sales', 'Profit']] #Fitting the model to the data kmeans = KMeans(n_clusters=k, random_state=0).fit(X) data['Cluster'] = kmeans.labels_ # Create a scatter plot using seaborn plt.figure(figsize=(10, 6)) sns.scatterplot(x='Sales', y='Profit', hue='Cluster', data=data, palette='viridis') plt.title('K-Means Clustering of Sales vs Profit') plt.xlabel('Sales') plt.ylabel('Profit') plt.show()

Run to view results

Conclusion and Findings

We use k means clustering for better decision making and efficiency. It appears that cluster 1 has high sales and high profit, while cluster 2 has high sales and low profit. Cluster 0 seems to have low sales and profit. So, the company needs to allocate resources to the low profit and low sales areas while sustaining the profitable areas as depicted.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Coursework 2: C11BD

Introduction

Dataset Analysis

Explanation

Data Cleaning

Explanation

Identifying Outliers

Explanation

Interpretation:

Explanation:

Explanation:

Exploratory Data Analysis

Explanation and Results

Data Modelling

Linear Regression Analysis

Coefficients:

Interpretation and Conclusion

k-means Clustering

Conclusion and Findings

Coursework 2: C11BD