Coursework 2: C11BD

Submitted by- Muskan Valecha

Student id- H00456189

Course name- Big Data Analytics (C11BD)

Introduction

In this assignment, we have to work on the dataset provided "superstore.csv" of a company which is aiming to increase its profitability by engaging the tools and techniques of big data analytics. We can discover trends and patterns from the data by identifying the potential strengths and the areas which needs improvements. We will use the modelling techniques to learn about the relations between variables and for better understanding we will use the visualization charts. Findings and conclusions from the visualizations have been explained, which give us the insights required to determine the factors affecting profitability. Moreover, we also check the areas for improvement and areas which have no value to the data.

Analysis

Step 1: We will import the data and necessary libraries. This is the initial stage of any data analysis task where we load the dataset into our Python environment.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Loading the dataset df = pd.read_csv("Superstore.csv") # Display the first few rows of the dataframe df.head()

Run to view results

1.We import pandas (pd) for data analysis and manipulation.

2.We import numpy (np) for scientific computation.

3.We import matplotlib. pyplotas plt for creating visualizations.

4.We import seaborn (sns) for data visualizations.

Data Cleaning

Data cleaning consists of spotting mistakes and outliers in the data set, as well as fixing them. This detailed cleaning process is aimed to make data more precise, making sure that it is fit for further analysis and model making, which will be used to get meaningful insights for profitability of the company. (Raschka, Sebastian)

Step 2: The next step is to look for any missing values, duplicate rows from the dataset.

# Check for missing values in each column missing_values = df.isnull().sum() print(df.isnull().sum()) # Remove rows with missing values or incorrect entries cleaned_data = df.dropna() # Identify and remove duplicates duplicate_rows = df.duplicated().sum() print("\nNumber of duplicate rows:", duplicate_rows)

Run to view results

We don't find any missing values in our dataset, nor any duplicate rows. So we can assume that the values are discrete and unique to the features considered to derive insights.

Step 3: Next we identify and remove the outliers from the data, showing and visualizing them graphically. We have used z score method to detect and remove outliers and then visualized them using boxplot.

# Detect outliers using Z-score from the 'sales' column import numpy as np import seaborn as sns from scipy import stats z_scores = np.abs(stats.zscore(cleaned_data['Sales'])) outliers = (z_scores > 3) # Remove outliers based on Z-score threshold (e.g., 3) cleaned_data = cleaned_data[~outliers] # Visualize outliers using a boxplot plt.figure(figsize=(9, 6)) sns.boxplot(x=cleaned_data['Sales']) plt.title('Boxplot of Sales Data with Outliers') plt.show()

Run to view results

Explanation

The boxplot confirms the presence of some remaining outliers, even after the initial filtering. Here in this step, we identify and remove outliers from the 'Sales' data based on Z-scores exceeding 3. Outliers are data points that fall outside the whiskers, which in this boxplot, fall essentially to the right-hand side towards positive deviation in sales.

As we can see from the above, the median sale is around $1000, and there are outliers both above and below the rest of the data. The data is right skewed which means that there is a wider range of sales above the median than below the median.

Step4: As we are done with correcting the errors and outliers, we will now be calculating summary statistics for the essential features.

# Print summary statistics summary_stats = cleaned_data[['Sales', 'Quantity' , 'Discount' , 'Profit']].describe() print(summary_stats)

Run to view results

Findings from the summary

Here we have calculated summary statistics for essential features like sales, quantity, discount, and profit. In conclusion, the data shows a high variation in sales figures with a large standard deviation across all metrics.

The average sales are $180.55 with a standard deviation of $300. The firm has recorded a total of 9867 sales out of which the minimum sales amount was $0.44, and the maximum was $2079.4.

Quantity pertains to the number of units sold in each transaction. Average number of units sold per transaction was 6.8 units with a standard deviation of 174.2 and overall the quantity ranges from minimum of 1 unit to 10000 units.

Most importantly, the average profit per sale was $19.40 with a standard deviation of $111. The negative values in profit show that there have been losses incurred maybe because of higher discounts offered on them or other reasons with a max profit range of $900.

Discount displays the discount rate applied to each transaction. The average discount applied was 15.8%, with the standard deviation of 20%. The highest discount applicable was 120%, which seems to be incorrect and an error in the data.

Plotting The Data

Step 5: Plotting the Data

As the data cleaned is now accurate and accessible for further analysis, we can visualize it by plotting it according to the essential features and defining the relationships between which the comparisons have been done.

import matplotlib.pyplot as plt import seaborn as sns # Categorical plot (Segment vs. Profit) plt.figure(figsize=(8, 6)) sns.barplot(x='Segment', y='Profit', data=df) plt.xlabel('Segment') plt.ylabel('Total Profit') plt.title('Total Profit by Segment') # Continuous plot (Sales vs. Profit) plt.figure(figsize=(8, 6)) sns.scatterplot(x='Sales', y='Profit', data=df) plt.xlabel('Sales') plt.ylabel('Profit') plt.title('Profit vs. Sales Scatter Plot')

Run to view results

Interpreting Visualization

Fig 1 - Basically, this graph depicts the fact that Consumer segment is the most profitable of all other segments of the business. This kind of information can be beneficial for a company's management when deciding how to use the resources available. For instance, a company could invest more money in the consumer segment if this segment is the most profitable.

Fig 2 - The scatter plot displayed shows a positive relationship between sales and profit. In other words, this indicates that with the growth in sales, the profit also goes up. But some data outliers which are not consistent with this trend also exist. For example, here we have a data point in the bottom left corner that represents high sales but low profit. It can happen maybe of reasons like high cost of goods sold or higher discounts offered.

Data Modelling

Step 6: Modeling Strategy Selection and Implementation

k-means Clustering

By using k-means clustering, we will find the customer segments where the company is most profitable and the improvement areas further.

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(df[['Sales', 'Discount', 'Quantity']]) from sklearn.cluster import KMeans # Choose the number of clusters (k=3) kmeans = KMeans(n_clusters=3) kmeans.fit(X_scaled) # Assign cluster labels to each data point df['cluster'] = kmeans.labels_ # Print cluster centroids print(kmeans.cluster_centers_) # Analyze average profit per cluster print(df.groupby('cluster')['Profit'].mean()) # Visualizing results through scatter plot plt.scatter(df['Sales'], df['Profit'], c=df['cluster']) plt.xlabel("Sales") plt.ylabel("Profit") plt.title("Sales vs. Profit by Cluster (K-Means)") plt.show()

Run to view results

Findings:

Cluster 3 (in blue) appears to have the highest sales and the highest profit. Cluster 1 (in green) appears to have lower sales and lower profits than Cluster 3, but still profitable. Cluster 2 (in red) appears to have low sales and high losses. Cluster 4 (in yellow) appears to have a range of sales figures, but all at a loss. Hence, the customer segments in cluster 3 are the most profitable and Cluster 4 customer segments needs immediate attention.

Linear Regression Analysis

Linear regression will provide us insights by forecasting future trends by defining relationship between variables and providing results. (Lee, Wei-Meng) We will define the coefficients of the model which will indicate the strength and direction of these relationships.

from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt X = cleaned_data[['Sales', 'Quantity', 'Discount']] y = cleaned_data['Profit'] model = LinearRegression() model.fit(X, y) #Coefficients, Intercept and predictions coefficients = model.coef_ print("Coefficients:", coefficients) intercept = model.intercept_ print("Intercept:", intercept) predictions = model.predict(X) score = model.score(X, y) print("Model Score:", score) # Visualizing the results through scatter plot of actual vs predicted profit plt.scatter(y, predictions) plt.xlabel('Actual Profit') plt.ylabel('Predicted Profit') plt.title('Actual vs. Predicted Profit') plt.show() residuals = y - predictions plt.scatter(predictions, residuals) plt.axhline(y=0, color='r', linestyle='-') plt.xlabel('Predicted Profit') plt.ylabel('Residuals') plt.title('Residual Plot') plt.show()

Run to view results

Results And Conclusions

We have used combination of linear regression and k means clustering as it can help the company identify how sales, quantity, and discounts impact profitability. Through k means, we find the segment cluster that need immediate attention and using regression, we conclude that -

Model Score (R Squared value) 0.192 (around 20%) indicates a weak correlation between the actual profits and the predicted profits. Ideally, the R-squared value should be closer to 1 for a strong positive correlation, which would mean the model is effectively predicting profits.

Coefficients: The positive sales coefficient suggests higher sales lead to higher predicted profits, as expected. The coefficient of quantity is very small, close to zero, and positive. It suggests that an increase in quantity has a minimal positive effect on predicted profit. Finally, discount coefficient is negative, indicating that a higher discount reduces the predicted profit, which is intuitive.

Intercept here is 31.3, the predicted profit when all features are zero.

It is reflected from the scatter graph that the actual profits are than predicted profits throughout the range. This shows that except for profit factors, that are explicitly linked to the variables used in the model, other factors are also probably involved.

The Residual plot shows the difference between the actual profit and the predicted profit for each data point. As depicted, negative random spread indicates that the model's errors are consistent across the range of predicted profits.

To summarize, the analysis yields to the company's sales data, profitability factors, customer segments, and improvement areas. This is done through data cleaning, visualization, clustering, and regression technique. Our recommendations focus on leveraging insights from customer segments and addressing areas requiring immediate attention to drive profitability improvements.

References

Raschka, Sebastian. Python Machine Learning. Google Books, Packt Publishing Ltd, 23 Sept. 2015, books.google.co.uk/books?hl=en&lr=&id=GOVOCwAAQBAJ&oi=fnd&pg=PP1&dq=machine+learning+dataset+python&ots=Ne8vO9TSVK&sig=mRvBu23qqmjE1mhJakg7bqklsik#v=onepage&q=machine%20learning%20dataset%20python&f=false. Accessed 17 Mar. 2024.

Lee, Wei-Meng. Python Machine Learning. Google Books, John Wiley & Sons, 4 Apr. 2019, books.google.co.uk/books?hl=en&lr=&id=9FOQDwAAQBAJ&oi=fnd&pg=PP2&dq=machine+learning+dataset+python&ots=p-llArTSxC&sig=ru8FSh8_LfUQGLHbL7EjZdognQc#v=onepage&q=machine%20learning%20dataset%20python&f=false. Accessed 17 Mar. 2024.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Coursework 2: C11BD

Introduction

Analysis

Data Cleaning

Explanation

Findings from the summary

Plotting The Data

Interpreting Visualization

Data Modelling

k-means Clustering

Findings:

Linear Regression Analysis

Results And Conclusions

References

Coursework 2: C11BD