# Coursework 2: C11BD

Chirag Lamba

H00454315

Big data analytics (C11BD)

## Introduction

In this assignment, we are given a data set, Superstore.csv, which is a sample of data provided by a company aimed at improving its profits through data analytics. Here we will explore the field of big data analytics and use advanced techniques to get insights that can be applied to increase profitability. Every entry in our dataset is actually a distinct transaction within a particular time period, based on which we can investigate the patterns, trends, and focus areas.

## Dataset Analysis

This notebook analyzes the Superstore.csv data to understand factors impacting profit.

First, let's start by importing the necessary libraries and loading the dataset.

Run to view results

### Explanation

## Data Cleaning

Next we take a look at the first few rows of the dataset to understand its structure and identify any potential data entry errors.

Run to view results

Having understood the structure of the dataset we proceed then with data cleaning. This will include detecting as well as correcting the missing values, outliers, and other inconsistencies in the data.

First of all we’ll check for any missing values, handling and correcting them.

Run to view results

### Explanation

Next, we'll identify and handle any data entry errors. This might include typos, inconsistencies, or incorrect data types. Understanding data types is crucial for working with data effectively.

Run to view results

### Identifying Outliers

Outliers are data points that significantly differ from other observations in the dataset. We can identify outliers by visualizing the data using histograms, box plots, or scatter plots.

Run to view results

### Explanation

The above histogram plot shows the distribution of profit after removing outliers using the IQR method. From the visualization we derive:

Run to view results

### Interpretation:

We have removed profits that have a negative value herein. The boxplot visualization shows the distribution of three variables: sales, quantity, and discount

For further additional analysis, we calculate the average profit for each product category and visualize them.

Run to view results

### Explanation:

After cleaning the data, it's important to validate the changes made and ensure that the dataset is now ready for further analysis.

Run to view results

### Explanation:

We have successfully imported and cleaned the dataset. Now we can proceed onto modelling the data for further analysis.

## Exploratory Data Analysis

Run to view results

## Explanation and Results

From the results, it appears that the Consumer segment has the highest total profit and highest sales volumes, followed by Corporate and Home Office segments. This is intuitive as most of the companies focus on consumers mostly and allocate their resources accordingly.

Run to view results

A positive correlation is observed which results in a general trend upwards, showing that as sales grow, profit grows too. On the other hand, the data points also scatter, showing that the profits do not always vary in proportion to the sales growth. There are some data points that show high sales but low profit and low sales but high profit.

## Data Modelling

### Linear Regression Analysis

We will use linear regression to analyze the factors that affect profitability. Linear regression is the best approach in this case as our target variable (profit) is a continuous variable.

Run to view results

### Coefficients:

## Interpretation and Conclusion

### k-means Clustering

Run to view results

### Conclusion and Findings

We use k means clustering for better decision making and efficiency. It appears that cluster 1 has high sales and high profit, while cluster 2 has high sales and low profit. Cluster 0 seems to have low sales and profit. So, the company needs to allocate resources to the low profit and low sales areas while sustaining the profitable areas as depicted.