HERIOT WATT UNIVERSITY
COURSE CODE :C11BD
COURSE NAME : Big Data Analytics
STUDENT NAME: Shreya Parakh
STUDENT ID : H00417178
COURSEWORK 2 -INDIVIDUAL ASSIGNMENT
Disclaimer : I verify that the presented work is all mine. I haven't claimed ownership of someone else's thoughts, words, or creations by making them my own. All sources in this work is appropriately cited and list of references has been provided below .
Executive Summary
Superstore had employed us to analyze their data and identify hidden opportunities to maximize profits and work on undetected deficiency that may hamper growth and profits of the company . In order to carry out this analysis we used a computer programing language called python on a platform known as deepnote . We first imported the client data in deepnote to allow is us perform logarithms on the dataset provided. Data received will be prefect and free from error is not important so we must verify the quality of data received . This is done by cleaning the data which involves a procedure to deal with errors , misinformation and outliers . Once the data has been cleaned and any inconsistencies treated the data is now eligible for use . Data received can be classified into two types categorial and continuous these were identified and diagrammatically presented and then modelling technique linear regression for deployed for further data analysis .
1. Introduction
In the current times with the increase in internet and internet of things big data analysis has become a significant current affair . The dynamics of business and technology has been changing hand in hand . Therefore most companies have opted to collect , store and process the huge amount of structured and unstructured data to identify trends and patterns present that where previously not known. This will help them increase their profitability and be able to provide better goods and services to their customers. Superstores being a retailor has opted and asked for data analysis in an aim to maximize their profits. And for doing so the following steps were undertaken and modelling strategies adopted.
2. Data Analysis and Strategies
We chose to use a coding language called python on the platform Deepnote in order to also allow the machine to execute numerous algorithms in order to perform data analytics on the retailer Superstore's provided data set, as requested by the client. There are several languages available nowadays besides Python, such as R, SQL (Structured Query Language), SAS (Statistical Analysis System), etc., however we decided to choose Python because Python is known for being simple and easy to read, making it simple for beginners to learn and use. As a data analysts we can concentrate on the data analysis rather than get overly involved in the coding , Excellent for manipulating and analyzing data: Python offers strong libraries like NumPy and Pandas that offer effective and practical capabilities for manipulating and analysing data. This makes it possible for data analysts to carry out difficult data operations like data cleaning, transformation, and aggregation rapidly and simply. In addition to this it also has good cross-platform interoperability which is advantageous for artificial intelligence and machine learning. (Payne, 2022)
We follow the steps in order to successfully analyze data and provide ways in which the client Superstores can increase profitability .
2.1 Import the data
This is the very first step which includes tasks such as identifying the data source, setting up the required libraries, importing the libraries into any script, reading the data, processing and altering it, and visualizing the outcomes are the general steps involved in importing data in Python.
The below code imports the derived dataset in the library . The stored dataset of superstores was already provided to us so we just had to import it into the library .
After importing the necessary libraries, in order to read the data from the source file or database we used the appropriate function or method as typed below code . Implication of this code is that we ask the machine to use pd.read_csv() function to read CSV files into a pandas Dataset.
0
3783
CA-2017-165204
1
7322
CA-2017-167626
2
1709
CA-2017-123491
3
2586
CA-2015-121041
4
356
CA-2016-138520
After we have instructed the machine to import and read the data , a code " dataset. Shape" is used to verify if the entire data which was present in the dataset has been accurately uploaded and read . This function counts the (rows , columns ) to ensure the entire data exists .
2.2 Cleaning the Data
Data cleaning is emphasized as one of the most critical step while carrying out analysis . This is because most of that data we get from the client could be unstructured or contain inconsistencies or be incomplete . This could result in misleading reports being generated thus limiting the benefit one could accrue , here limiting the possibility for superstores to maximize their profits . Data cleaning can be defined as the process of detecting and rectifying or eliminating errors, inconsistencies, and inaccuracy in datasets . The primary objective of data cleaning is to increase the reliability and usefulness of data by enhancing its quality and correctness. (Porritt, 2023).
In order to clean data the following can be used (please refer lecture slides for below )
Removing duplicates: Eliminating duplicates by deleting similar rows or entries.
Handling missing values: Managing missing values entails locating and dealing with inaccurate or insufficient data, whether it be by deleting the faulty rows or columns, applying missing values using inferential analysis.
Standardizing data: Data must be transformed into a proper format to be benchmarked , which facilitates comparison and analysis.
Correcting errors: finding and fixing dataset errors such typos, spelling mistakes, and inaccurate data entries.
Handling outliers: Managing outliers entails locating and dealing with uncommon or abnormal values in the dataset, either by deleting them or by altering them using statistical methods.
Handling inconsistent data: Managing inconsistent data comprises locating and dealing with data that does not follow logic or business principles, or that is inconsistent with other data sources.
The dataset.info() code provides a concise summary of a DataFrame. This code gives information about the DataFrame, including the total number of rows and columns, the data type of each column, and the number of non-null values in each column. This is a crucial step while cleaning or pre processing data . When dealing with a huge set of data his method is extremely helpful for understanding the structure of your dataset. (Database Concepts and Structures, n.d.)
This also provides detail information of the data frame by
Offers a brief explanation of the DataFrame to assist in understanding of the dataset's structure.
Reveals every column's data types, assisting you in finding possible data type issues or discrepancies.
Displays the proportion of non-null values in each column that helps in detecting missing or insufficient data.
Helps you optimize the efficiency of your code by displaying the memory use.
Even though this code gives us information about dataframes it has a few inherient factors which limits its benefit such as
it fails to show the real data,
For really big datasets with several columns, the result of dataset.info() can be intimidating.
There's a chance that the function fails to detect every possible data problem, such poor data quality or formatting mistakes.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9994 non-null int64
1 Order ID 9994 non-null object
2 Order Date 9994 non-null object
3 Ship Date 9994 non-null object
4 Ship Mode 9994 non-null object
5 Customer ID 9994 non-null object
6 Customer Name 9994 non-null object
7 Customer_no 9994 non-null int64
8 Segment 9994 non-null object
9 Segment_no 9994 non-null int64
10 Country 9994 non-null object
11 City 9994 non-null object
12 State 9994 non-null object
13 State_no 9994 non-null int64
14 Postal Code 9994 non-null int64
15 Region 9994 non-null object
16 Region_no 9994 non-null int64
17 Product ID 9994 non-null object
18 Category 9994 non-null object
19 Category_no 9994 non-null int64
20 Sub-Category 9994 non-null object
21 Sub-Category_no 9994 non-null int64
22 Product Name 9994 non-null object
23 Product Name_no 9994 non-null int64
24 Sales 9994 non-null float64
25 Quantity 9994 non-null int64
26 Discount 9994 non-null float64
27 Profit 9994 non-null float64
28 Returned 9994 non-null bool
dtypes: bool(1), float64(3), int64(10), object(15)
memory usage: 2.1+ MB
Here in the code below we have just renamed the columns to avoid confusion while coding further
Identifying Missing Values
The data received might contain missing values which could limit the quality of the data used . Missing values could be a result of various factors like data entry errors by human , non response to questioners , data loss or corruption , data processing errors and due to outliers . (Guide to Data Cleaning, n.d.)
The code below is used to identify any missing value within the dataset . And provide the number or missing values within each category if any cell seems to be blank or if there exists a missing value . This is important because for instance if there were 3 columns namely A,B,C and suppose there was a missing value in B . When we process the data the machine would not be able to identify that and allocate the value in column C in Column B . This might lead to irrelevant and misleading data being produced .
row_id 0
order_id 0
order_date 0
ship_date 0
ship_mode 0
customer_id 0
customer_name 0
Customer_no 0
segment 0
Segment_no 0
country 0
city 0
state 0
State_no 0
postal_code 0
region 0
Region_no 0
product_id 0
category 0
Category_no 0
sub_category 0
Sub-Category_no 0
product_name 0
Product Name_no 0
sales 0
quantity 0
discount 0
profit 0
returned 0
dtype: int64
Here since all the columns are showing a value of 0 we can conclude that there are no missing values within the dataset provided .
Duplication
Another factor to be taken into consideration while cleaning data is duplicate values . Data analysts frequently employ methods like data deduplication, which entails locating and deleting identical data entries, to resolve duplication in a dataset. Analysts can guarantee the precision, consistency, and effectiveness of data analysis by deleting duplicate data.(Guide to Data Cleaning, n.d.)
The code below aims to identify and delete any column that have been duplicated in the dataset .
After which we Print the datashape to verify if there was any changes in the rows and columns within the data .
(9994, 29)
since the datashape is same as before we can conclude that the data is free from any duplication .
Change the type of columns
It is crucial to modify the type of a column in a dataset so that we can manipulate the data mathematically and conduct insightful analysis. For instance, changing the Age column from a string data type to an integer data type enables us to compute summary statistics like the mean and standard deviation of ages, which are not feasible when the column is kept as a string.
Furthermore, some machine learning and statistical algorithms need particular data types as inputs. For instance, if a dataset contains a categorical variable saved as a string, it would need to be changed to a numeric data type in order to be used in a linear regression model since those models only accept numeric input data.
As a result, altering the type of a column in a dataset can aid in more accurate data analysis and modeling, as well as better decision-making based on the analysis's findings.
The code below enables changing the data types for further analytics .
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 row_id 9994 non-null int64
1 order_id 9994 non-null string
2 order_date 9994 non-null datetime64[ns]
3 ship_date 9994 non-null datetime64[ns]
4 ship_mode 9994 non-null category
5 customer_id 9994 non-null string
6 customer_name 9994 non-null string
7 Customer_no 9994 non-null int64
8 segment 9994 non-null category
9 Segment_no 9994 non-null int64
10 country 9994 non-null string
11 city 9994 non-null string
12 state 9994 non-null string
13 State_no 9994 non-null int64
14 postal_code 9994 non-null Int64
15 region 9994 non-null category
16 Region_no 9994 non-null int64
17 product_id 9994 non-null string
18 category 9994 non-null category
19 Category_no 9994 non-null int64
20 sub_category 9994 non-null category
21 Sub-Category_no 9994 non-null int64
22 product_name 9994 non-null string
23 Product Name_no 9994 non-null int64
24 sales 9994 non-null float64
25 quantity 9994 non-null int64
26 discount 9994 non-null float64
27 profit 9994 non-null float64
28 returned 9994 non-null bool
dtypes: Int64(1), bool(1), category(5), datetime64[ns](2), float64(3), int64(9), string(8)
memory usage: 1.8 MB
Outliers
Data points known as outliers vary significantly from other observations in a dataset. Such findings may have values that are exceptionally high or low compared to the rest of the dataset's data points. Exceptional occurrences, measurement mistakes, entry of information errors, and other factors can all lead to outliers.
The analysis of data and modeling with statistics can be greatly influenced by outliers because their presence may distort the results and produce false conclusions. For instance, if an outlier is substantially higher than the rest of the values, this might substantially increase the mean value and give the impression that the data is more dispersed than it actually is. Regression analysis can produce incorrect forecasts when an outlier has an exaggerated effect on the regression line.(Suresh, 2020)
outliers can be found and dealt with in a number of methods. Using visualization tools like box plots or scatter plots to find unusual values that differ noticeably from other observations is a frequent practice. To find observations that are significantly different from the mean or median of the dataset, another approach is to utilize statistical methods like the Z-score or interquartile range (IQR).
There are numerous approaches for treating outliers once they have been found. one method is to eliminate them from the dataset. Another strategy is to swap them out for a more representative value, such the dataset's median or mean. If the outliers are valid data points that represent exceptional occurrences or severe events, it may occasionally be desirable to leave them in the dataset.
In order to treat outliers I have opted for IQR method .
The dispersion or variability of a dataset is depicted statistically by the interquartile range (IQR). It is described as the variation between the dataset's upper and lower quartiles.
A quartile is a form of quantile used in statistics that split a dataset into four equal groups. The quartiles separate the data into four categories with an equal number of observations in each of them. Q1, Q2, and Q3 are used to identify the quartiles.
The initial quartile, sometimes referred to as Q1, or the 25th percentile, denotes the value at which 25% of the data are below it and 75% are above it. The median of the data is represented by Q2, also known as the second quartile or the 50th percentile. It is the boundary between the data points where 50% of the data come below and 50% go above. The third quartile, sometimes referred to as Q3, is the point at which 25% of the data is up and 75% of the data is below it.
Mathematically, the IQR can be represented as:
IQR = Q3 - Q1
Observations that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are regarded as outliers when using the IQR method to find outliers. Depending on the study objective and the type of data, this threshold can be changed.
In contrast to other measures of spread like the range or standard deviation, the IQR is a reliable measure of spread that is less susceptible to outliers. It provides a more accurate representation of the normal distribution of the data because it is based on the middle 50% of the data, which is less impacted by extreme numbers.
The IQR is a helpful indicator of a dataset's variability that may be utilised in summary statistics or visualisations like box plots. It offers details about the data's middle 50% and can be used to compare the dispersion of other datasets.
The below code is using IQR to treat outliers and also then displaying Box plots to enable visual analysis
sales 192.66000
quantity 3.00000
discount 0.20000
profit 27.63525
dtype: float64
while concluding we can observe that the datasets shape has changed this indicates that outliers were present within this dataset and the code has successfully identified and eliminated those from the respective columns sales ,discount , profit and quantity .
Fig 1 box plot graph
Interpretating the above 4 graphs (fig 1 ) using the methodology . The box graph is an efficient instrument to illustrate how data behaves both in the center and at the extremity of the distribution. The box illustrates how different values are clustered around the median, a measure of central tendency. values that are 1.5 times the interquartile range represent the extreme end or edges of the distribution. Outliers or unexpected values given the distribution's form are values that fall beyond the range.
here for instance in sales we can observe that there is a large number of outliers from the upper end of Q3 (bold black dotted line ) . this thus reduces the quality of information we posses for sales and discourages its use for further analysis . Therefore we would try focus on other variables for our analysis .
Removing outliers using Standard Deviation
Another method to treat outliers can be using standard deviation provided the data follows normal distribution which is true in this case .
A technique frequently used to eliminate outliers from a dataset is standard deviation. The fundamental goal of utilizing standard deviation is to eliminate data values that substantially deviate from the dataset's mean.
The following steps are used
Determine the dataset's mean and standard deviation.
Set an appropriate threshold for outliers that is typically 2 or 3 standard deviations from the mean.
Determine which data points deviate from the usual pattern and label them as outliers.
From the dataset, eliminate the outliers.
when following these steps analysists must keep in mind that this approach makes the assumption that the data follows normal distribution. Other techniques, such as the interquartile range (IQR) or median absolute deviation (MAD), may be more suited if the data are not normally distributed. Another factor to consider is that eliminating outliers can have a big effect on the dataset's statistical characteristics, like the mean, standard deviation, and correlation coefficients. Therefore, it is crucial to take into account how eliminating outliers would affect the data analysis and interpretation. In some circumstances, it might be preferable to maintain the outliers or employ different techniques to lessen their impact.
Here in the code below we are asking the machine to identify columns that have numerical value since standard deviation can only be applied to there selected columns.
After we have selected numerical columns we define the threshold for normal distribution which is the range as coded below -3 and +3 . Any data point which falls outside this range will be considered an outlier .
Once the condition or threshold has been defined and outliers identified the below code explains which all columns in the dataset have outliers by giving us a value "true" and columns free from outliers as "false" . The output also portray the number of outliers . This gives the analysist clarity as to which variables they must gather further information and perform analytical calculations to get deeper insights in order to increase profitability .
Removing Outliers using 3 Standard Deviation
====================== ========== ======= =============== ============= =============
Outlier (Previously) Outliers Count Column Lower Limit Upper Limit
====================== ========== ======= =============== ============= =============
False False Customer_no -305.173 1007.78
False False Segment_no -0.632653 3.94274
False False State_no -23.8815 69.4582
False False Region_no -0.44042 5.4956
False False Category_no 0.230114 3.75028
False False Sub-Category_no -6.56334 20.0401
True True 194 sales -200.955 352.359
False False quantity -2.23044 9.04445
True True 5 discount -0.222397 0.41585
True True 87 profit -38.5699 64.5687
====================== ========== ======= =============== ============= =============
From the output table above we can conclude that outliers are present in columns namely in sales , discount and profit respectively . Sales has the highest number of outliers discouraging us to use this factor while trying to find strategies to increase profits as this data can be considered not reliable . In Contrast we could still use discount as a factor as it has few discrepancies.
2.3 Summary Statistics of the Cleaned data
A set of statistical indicators called summary statistics, commonly referred to as descriptive statistics, provides a brief explanation of the key characteristics of a dataset. These metrics are employed to define a dataset's central tendency, dispersion, and shape.(1. Data Display and Summary | the Bmj, 2020)
Typical summaries of statistics include:
mean - dataset's arithmetic average.
median- midway value of the data arranged in ascending or decending order .
mode - most occurring value in dataset
Range - the variance between a dataset's highest and lowest values.
Standard deviation - A measurement of how widely a dataset deviates from the mean is called standard deviation.
Variance - an indicator of a dataset's dispersion.
Summary statistics useful as it helps provide a brief summary of the data and can be used to spot any correlations or interconnections between variables. Analysts can discuss the major aspects of the data to others by simplifying it into a small number of critical figures. By sharing details about a dataset's properties summary statistics can aid in decision-making. For instance, if a dataset's mean differs greatly from a reference value, it might indicate that the method used to produce the data was inaccurate. Summary data might serve as a starting point for additional investigation. A dataset's mean and standard deviation, for instance, might be used to compute confidence intervals, run hypothesis tests, or create predictive models.
Below code runs the function for summary dataset .
count
7140.0
7140.0
mean
4986.187955182073
351.3058823529412
std
2883.0746342983084
218.8261383564104
min
1.0
1.0
25%
2475.75
164.0
50%
4995.5
337.0
75%
7446.75
534.0
max
9993.0
791.0
Interpretating the above tables we need to compare the median , mode and mean of each column to understand if these are in link to one another . This will explain if the data follows normal distribution and the curve is not unusual . If that mean varies a lot as compared to the mode or median then there is a high possibility that the data might me skewed .
We can also use standard deviation to understand the dataset . If the Standard deviation is huge this indicates that the data is spread nearby to the mean .
Similarly range can be employed for analysis . Range is the difference between the maximum and minimum value of the variable . Greater the range there is higher variation within the dataset . For instance in the given dataset both sales and profit have a higher range and thus this may indicate that there is higher variability within these variables.
2.4 Visually plotting the data
Data can be classified into two types continuous and categorical .
continuous data is a type of quantifiable or numeric data in which the outputs can be any figure inside a given range such as temperature, weight, and height. In order to visualize continuous data, histograms, scatter plots, and box plots are frequently used.(Types of Data in Statistics [with Example], n.d.)
Data that may be labelled are qualitative or countable data whose results can be divided into different groups such as gender, eye colour, and race . Graphs like bar charts, pie charts, or frequency tables are frequently used to display categorical data.(Types of Data in Statistics [with Example], n.d.)
All four variables profit , sales , discount and quantity can be used as a continuous data for analysis . Both discount and quantity can be a reliable source to base our strategies of profit maximization in contrast to sales which has a number of outliers and other issues thus limiting the validity of its information . Now we proceed to use sales volume also known as quantity and present a histogram for the same .
The below code is used to first import Plotly Express as px, to create visual tools like scatter graph bar chart histograms etc .
fig 2 - histogram - x= quantity and y = sum of profits
In the histogram above (fig 2) we express x axis as sales quantity or volume and y axis as sum of profits to understand what quantities the product were sold to increase profits . Here from the graph we can see that the highest revenue is earned from quantity bundles 2 to 3 . When going back to the information provided we can observe that products such as stationery home appliances and electronics that were majorly sold in 2 or 3 quantity bundles are evidenced to eb profitable . Hence superstore can focus on building greater stock of these to ensure availability so that the consumer does not shift to competitors in an event of stockout . Also superstore could aim to keep larger variety of these products and make promotional bundles for example "buy 2 get 1 free " to tempt the customer to buy these products and this will increase the revenue for superstores .
For categorial data we have selected location that is "state" and "segment" to base our profitability analysis and below is the code for scatter diagram .
Fig 3 - scatter graph (categorial data)
In an attempt to interpret the diagram (fig 3) we have 3 factors to consider segment , state and profit . We have expresses x axis as the different states within which superstores operate in the USA . And y axis as the three segments of products offered by superstores within the states . Different color are used to show various ranges of profits for intsnace blue shows loss above 20K and yellow shows profits more than 60K .
Visually analyzing this graph
In the home and office segment the most profitable state is Idaho earning a profit grater than 60k , then newyork with 46K , louisanaia with 41K . In corporate segment highest three profitable stores are colorado with earnings of 64k , south carolina with 62K and then Arkansas with 61K . Similarly in consumer segment Colrado with 64k ,Nebraska with 50k and connecticut with 40k. This analysis provides insights regarding which location or state has the highest demand for a particular segment. This information can guide superstores regarding the most profitable states of a particular segment and encourage them to focus on increasing revenue from the following states by taking steps like opening more branches or stores for the identified segment .
Also just by looking at the graph and locating the yellow scatter points (profit more than 64k) one can clearly point out that corporate segment is the most popular one among the three segments. Similarly Colorado state is the locality in which superstore is able to earn the highest profits (64k in consumer segment and 62k in corporate segment ) as compared to the other states . Using this information superstores could open more branches within this state of dig in further to understand the rationale for this and try and deploy the same business strategy in other branches in other states to increase its profits . They could also look at the worst performing branch within a state and take measures to either improve its performance or shut it down and use that money in other investments.
Visual analysis on these graphs may not provide detail information but they help in determining the areas to explore for further insight thus saving management time and work . Also graphs make it easier to understand and communicate to managers not have the same background or expertise as a data analysist to why a certain strategy was proposed .
2.5 Modelling Strategy
A modelling strategy is a methodical method for developing and improving prediction models. The kind of issue that we are trying to address, the data at your disposal, and the objective of the model all play a role in selecting an effective modelling technique. Some of the popular tools are
Linear regression: Dependent on a single or multiple input variables, a model of linear regression can be employed to forecast a continuous target variable. It works well in situations where there is a linear relationship between the input variables and the goal variable.
Logistic regression: In order to forecast a binary target variable from one or more input variables, a logistic regression model is used. It is appropriate for issues when the binary target variable is present, such as determining whether or not a customer would churn.
Decision trees: To make judgements based on a set of rules, a decision tree is a paradigm that resembles a flowchart. It works well in situations when the goal variable is categorical or continuous and the input variables are numerical. (Gowani, 2022)
The issue in question and the data that are accessible must be carefully considered before selecting the best modelling technique.
For our analysis we could use K mean clustering method an Unsupervised artificial intelligence method K-means clustering serves to combine together identical data points depending on their commonalities. The procedure of dividing the data into K number of clusters is iterative. The user chooses the array of clusters depending on topic expertise or experimentation. (Python, n.d.)
Benefits of K mean
It is straightforward and simple to use.
Massive amounts of data can be effectively managed.
It is a scalable method that can effectively handle large amounts of data.
It can reveal information about the data's structure and locate logical groups.
K-means clustering is an effective tool for analyzing exploratory data, but before applying it to a particular issue subject, it is crucial to thoroughly analyse its constraints and underlying assumptions which are
It needs the user to select the cluster list, which can be challenging to decide.
It may be vulnerable to how the centroids are first positioned, which could lead to various solutions.
The clusters are assumed to be spherical and to have a constant variance.
With datasets that have outliers or complicated systems, it could not perform as expected.
K mean is a good method to use but we are using Linear Regression for this report.