Coursework 2: C11BD
Chirag Lamba
H00454315
Big data analytics (C11BD)
Introduction
In this assignment, we are given a data set, Superstore.csv, which is a sample of data provided by a company aimed at improving its profits through data analytics. Here we will explore the field of big data analytics and use advanced techniques to get insights that can be applied to increase profitability. Every entry in our dataset is actually a distinct transaction within a particular time period, based on which we can investigate the patterns, trends, and focus areas.
Dataset Analysis
This notebook analyzes the Superstore.csv data to understand factors impacting profit.
First, let's start by importing the necessary libraries and loading the dataset.
Run to view results
Explanation
Data Cleaning
Next we take a look at the first few rows of the dataset to understand its structure and identify any potential data entry errors.
Run to view results
Having understood the structure of the dataset we proceed then with data cleaning. This will include detecting as well as correcting the missing values, outliers, and other inconsistencies in the data.
First of all we’ll check for any missing values, handling and correcting them.
Run to view results
Explanation
Next, we'll identify and handle any data entry errors. This might include typos, inconsistencies, or incorrect data types. Understanding data types is crucial for working with data effectively.
Run to view results
Identifying Outliers
Outliers are data points that significantly differ from other observations in the dataset. We can identify outliers by visualizing the data using histograms, box plots, or scatter plots.
Run to view results
Explanation
The above histogram plot shows the distribution of profit after removing outliers using the IQR method. From the visualization we derive:
Run to view results
Interpretation:
We have removed profits that have a negative value herein. The boxplot visualization shows the distribution of three variables: sales, quantity, and discount
For further additional analysis, we calculate the average profit for each product category and visualize them.
Run to view results
Explanation:
After cleaning the data, it's important to validate the changes made and ensure that the dataset is now ready for further analysis.
Run to view results
Explanation:
We have successfully imported and cleaned the dataset. Now we can proceed onto modelling the data for further analysis.
Exploratory Data Analysis
Run to view results
Explanation and Results
From the results, it appears that the Consumer segment has the highest total profit and highest sales volumes, followed by Corporate and Home Office segments. This is intuitive as most of the companies focus on consumers mostly and allocate their resources accordingly.
Run to view results
A positive correlation is observed which results in a general trend upwards, showing that as sales grow, profit grows too. On the other hand, the data points also scatter, showing that the profits do not always vary in proportion to the sales growth. There are some data points that show high sales but low profit and low sales but high profit.
Data Modelling
Linear Regression Analysis
We will use linear regression to analyze the factors that affect profitability. Linear regression is the best approach in this case as our target variable (profit) is a continuous variable.
Run to view results
Coefficients:
Interpretation and Conclusion
k-means Clustering
Run to view results
Conclusion and Findings
We use k means clustering for better decision making and efficiency. It appears that cluster 1 has high sales and high profit, while cluster 2 has high sales and low profit. Cluster 0 seems to have low sales and profit. So, the company needs to allocate resources to the low profit and low sales areas while sustaining the profitable areas as depicted.