# Coursework 2: C11BD

Submitted by- Muskan Valecha

Student id- H00456189

Course name- Big Data Analytics (C11BD)

## Introduction

In this assignment, we have to work on the dataset provided "superstore.csv" of a company which is aiming to increase its profitability by engaging the tools and techniques of big data analytics. We can discover trends and patterns from the data by identifying the potential strengths and the areas which needs improvements. We will use the modelling techniques to learn about the relations between variables and for better understanding we will use the visualization charts. Findings and conclusions from the visualizations have been explained, which give us the insights required to determine the factors affecting profitability. Moreover, we also check the areas for improvement and areas which have no value to the data.

# Analysis

Step 1: We will import the data and necessary libraries. This is the initial stage of any data analysis task where we load the dataset into our Python environment.

1.We import pandas (pd) for data analysis and manipulation.

2.We import numpy (np) for scientific computation.

3.We import matplotlib. pyplotas plt for creating visualizations.

4.We import seaborn (sns) for data visualizations.

### Data Cleaning

Data cleaning consists of spotting mistakes and outliers in the data set, as well as fixing them. This detailed cleaning process is aimed to make data more precise, making sure that it is fit for further analysis and model making, which will be used to get meaningful insights for profitability of the company. (Raschka, Sebastian)

Step 2: The next step is to look for any missing values, duplicate rows from the dataset.

We don't find any missing values in our dataset, nor any duplicate rows. So we can assume that the values are discrete and unique to the features considered to derive insights.

Step 3: Next we identify and remove the outliers from the data, showing and visualizing them graphically. We have used z score method to detect and remove outliers and then visualized them using boxplot.

### Explanation

The boxplot confirms the presence of some remaining outliers, even after the initial filtering. Here in this step, we identify and remove outliers from the 'Sales' data based on Z-scores exceeding 3. Outliers are data points that fall outside the whiskers, which in this boxplot, fall essentially to the right-hand side towards positive deviation in sales.

As we can see from the above, the median sale is around $1000, and there are outliers both above and below the rest of the data. The data is right skewed which means that there is a wider range of sales above the median than below the median.

Step4: As we are done with correcting the errors and outliers, we will now be calculating summary statistics for the essential features.

### Findings from the summary

Here we have calculated summary statistics for essential features like sales, quantity, discount, and profit. In conclusion, the data shows a high variation in sales figures with a large standard deviation across all metrics.

## Plotting The Data

Step 5: Plotting the Data

As the data cleaned is now accurate and accessible for further analysis, we can visualize it by plotting it according to the essential features and defining the relationships between which the comparisons have been done.

### Interpreting Visualization

Fig 1 - Basically, this graph depicts the fact that Consumer segment is the most profitable of all other segments of the business. This kind of information can be beneficial for a company's management when deciding how to use the resources available. For instance, a company could invest more money in the consumer segment if this segment is the most profitable.

Fig 2 - The scatter plot displayed shows a positive relationship between sales and profit. In other words, this indicates that with the growth in sales, the profit also goes up. But some data outliers which are not consistent with this trend also exist. For example, here we have a data point in the bottom left corner that represents high sales but low profit. It can happen maybe of reasons like high cost of goods sold or higher discounts offered.

## Data Modelling

Step 6: Modeling Strategy Selection and Implementation

### k-means Clustering

By using k-means clustering, we will find the customer segments where the company is most profitable and the improvement areas further.

### Findings:

Cluster 3 (in blue) appears to have the highest sales and the highest profit. Cluster 1 (in green) appears to have lower sales and lower profits than Cluster 3, but still profitable. Cluster 2 (in red) appears to have low sales and high losses. Cluster 4 (in yellow) appears to have a range of sales figures, but all at a loss. Hence, the customer segments in cluster 3 are the most profitable and Cluster 4 customer segments needs immediate attention.

### Linear Regression Analysis

Linear regression will provide us insights by forecasting future trends by defining relationship between variables and providing results. (Lee, Wei-Meng) We will define the coefficients of the model which will indicate the strength and direction of these relationships.

## Results And Conclusions

We have used combination of linear regression and k means clustering as it can help the company identify how sales, quantity, and discounts impact profitability. Through k means, we find the segment cluster that need immediate attention and using regression, we conclude that -

To summarize, the analysis yields to the company's sales data, profitability factors, customer segments, and improvement areas. This is done through data cleaning, visualization, clustering, and regression technique. Our recommendations focus on leveraging insights from customer segments and addressing areas requiring immediate attention to drive profitability improvements.

### References

Raschka, Sebastian. Python Machine Learning. Google Books, Packt Publishing Ltd, 23 Sept. 2015, books.google.co.uk/books?hl=en&lr=&id=GOVOCwAAQBAJ&oi=fnd&pg=PP1&dq=machine+learning+dataset+python&ots=Ne8vO9TSVK&sig=mRvBu23qqmjE1mhJakg7bqklsik#v=onepage&q=machine%20learning%20dataset%20python&f=false. Accessed 17 Mar. 2024.

Lee, Wei-Meng. Python Machine Learning. Google Books, John Wiley & Sons, 4 Apr. 2019, books.google.co.uk/books?hl=en&lr=&id=9FOQDwAAQBAJ&oi=fnd&pg=PP2&dq=machine+learning+dataset+python&ots=p-llArTSxC&sig=ru8FSh8_LfUQGLHbL7EjZdognQc#v=onepage&q=machine%20learning%20dataset%20python&f=false. Accessed 17 Mar. 2024.