HERIOT WATT UNIVERSITY
COURSE CODE :C11BD
COURSE NAME : Big Data Analytics
STUDENT NAME: Shreya Parakh
STUDENT ID : H00417178
COURSEWORK 2 -INDIVIDUAL ASSIGNMENT
Disclaimer : I verify that the presented work is all mine. I haven't claimed ownership of someone else's thoughts, words, or creations by making them my own. All sources in this work is appropriately cited and list of references has been provided below .
Executive Summary
Superstore had employed us to analyze their data and identify hidden opportunities to maximize profits and work on undetected deficiency that may hamper growth and profits of the company . In order to carry out this analysis we used a computer programing language called python on a platform known as deepnote . We first imported the client data in deepnote to allow is us perform logarithms on the dataset provided. Data received will be prefect and free from error is not important so we must verify the quality of data received . This is done by cleaning the data which involves a procedure to deal with errors , misinformation and outliers . Once the data has been cleaned and any inconsistencies treated the data is now eligible for use . Data received can be classified into two types categorial and continuous these were identified and diagrammatically presented and then modelling technique linear regression for deployed for further data analysis .
1. Introduction
In the current times with the increase in internet and internet of things big data analysis has become a significant current affair . The dynamics of business and technology has been changing hand in hand . Therefore most companies have opted to collect , store and process the huge amount of structured and unstructured data to identify trends and patterns present that where previously not known. This will help them increase their profitability and be able to provide better goods and services to their customers. Superstores being a retailor has opted and asked for data analysis in an aim to maximize their profits. And for doing so the following steps were undertaken and modelling strategies adopted.
2. Data Analysis and Strategies
We chose to use a coding language called python on the platform Deepnote in order to also allow the machine to execute numerous algorithms in order to perform data analytics on the retailer Superstore's provided data set, as requested by the client. There are several languages available nowadays besides Python, such as R, SQL (Structured Query Language), SAS (Statistical Analysis System), etc., however we decided to choose Python because Python is known for being simple and easy to read, making it simple for beginners to learn and use. As a data analysts we can concentrate on the data analysis rather than get overly involved in the coding , Excellent for manipulating and analyzing data: Python offers strong libraries like NumPy and Pandas that offer effective and practical capabilities for manipulating and analysing data. This makes it possible for data analysts to carry out difficult data operations like data cleaning, transformation, and aggregation rapidly and simply. In addition to this it also has good cross-platform interoperability which is advantageous for artificial intelligence and machine learning. (Payne, 2022)
We follow the steps in order to successfully analyze data and provide ways in which the client Superstores can increase profitability .
2.1 Import the data
This is the very first step which includes tasks such as identifying the data source, setting up the required libraries, importing the libraries into any script, reading the data, processing and altering it, and visualizing the outcomes are the general steps involved in importing data in Python.
The below code imports the derived dataset in the library . The stored dataset of superstores was already provided to us so we just had to import it into the library .
After importing the necessary libraries, in order to read the data from the source file or database we used the appropriate function or method as typed below code . Implication of this code is that we ask the machine to use pd.read_csv() function to read CSV files into a pandas Dataset.
After we have instructed the machine to import and read the data , a code " dataset. Shape" is used to verify if the entire data which was present in the dataset has been accurately uploaded and read . This function counts the (rows , columns ) to ensure the entire data exists .
2.2 Cleaning the Data
Data cleaning is emphasized as one of the most critical step while carrying out analysis . This is because most of that data we get from the client could be unstructured or contain inconsistencies or be incomplete . This could result in misleading reports being generated thus limiting the benefit one could accrue , here limiting the possibility for superstores to maximize their profits . Data cleaning can be defined as the process of detecting and rectifying or eliminating errors, inconsistencies, and inaccuracy in datasets . The primary objective of data cleaning is to increase the reliability and usefulness of data by enhancing its quality and correctness. (Porritt, 2023).
In order to clean data the following can be used (please refer lecture slides for below )
The dataset.info() code provides a concise summary of a DataFrame. This code gives information about the DataFrame, including the total number of rows and columns, the data type of each column, and the number of non-null values in each column. This is a crucial step while cleaning or pre processing data . When dealing with a huge set of data his method is extremely helpful for understanding the structure of your dataset. (Database Concepts and Structures, n.d.)
This also provides detail information of the data frame by
Even though this code gives us information about dataframes it has a few inherient factors which limits its benefit such as
Here in the code below we have just renamed the columns to avoid confusion while coding further
Identifying Missing Values
The data received might contain missing values which could limit the quality of the data used . Missing values could be a result of various factors like data entry errors by human , non response to questioners , data loss or corruption , data processing errors and due to outliers . (Guide to Data Cleaning, n.d.)
The code below is used to identify any missing value within the dataset . And provide the number or missing values within each category if any cell seems to be blank or if there exists a missing value . This is important because for instance if there were 3 columns namely A,B,C and suppose there was a missing value in B . When we process the data the machine would not be able to identify that and allocate the value in column C in Column B . This might lead to irrelevant and misleading data being produced .
Here since all the columns are showing a value of 0 we can conclude that there are no missing values within the dataset provided .
Duplication
Another factor to be taken into consideration while cleaning data is duplicate values . Data analysts frequently employ methods like data deduplication, which entails locating and deleting identical data entries, to resolve duplication in a dataset. Analysts can guarantee the precision, consistency, and effectiveness of data analysis by deleting duplicate data.(Guide to Data Cleaning, n.d.)
The code below aims to identify and delete any column that have been duplicated in the dataset .
After which we Print the datashape to verify if there was any changes in the rows and columns within the data .
since the datashape is same as before we can conclude that the data is free from any duplication .
Change the type of columns
It is crucial to modify the type of a column in a dataset so that we can manipulate the data mathematically and conduct insightful analysis. For instance, changing the Age column from a string data type to an integer data type enables us to compute summary statistics like the mean and standard deviation of ages, which are not feasible when the column is kept as a string.
Furthermore, some machine learning and statistical algorithms need particular data types as inputs. For instance, if a dataset contains a categorical variable saved as a string, it would need to be changed to a numeric data type in order to be used in a linear regression model since those models only accept numeric input data.
As a result, altering the type of a column in a dataset can aid in more accurate data analysis and modeling, as well as better decision-making based on the analysis's findings.
The code below enables changing the data types for further analytics .
Outliers
Data points known as outliers vary significantly from other observations in a dataset. Such findings may have values that are exceptionally high or low compared to the rest of the dataset's data points. Exceptional occurrences, measurement mistakes, entry of information errors, and other factors can all lead to outliers.
The analysis of data and modeling with statistics can be greatly influenced by outliers because their presence may distort the results and produce false conclusions. For instance, if an outlier is substantially higher than the rest of the values, this might substantially increase the mean value and give the impression that the data is more dispersed than it actually is. Regression analysis can produce incorrect forecasts when an outlier has an exaggerated effect on the regression line.(Suresh, 2020)
outliers can be found and dealt with in a number of methods. Using visualization tools like box plots or scatter plots to find unusual values that differ noticeably from other observations is a frequent practice. To find observations that are significantly different from the mean or median of the dataset, another approach is to utilize statistical methods like the Z-score or interquartile range (IQR).
There are numerous approaches for treating outliers once they have been found. one method is to eliminate them from the dataset. Another strategy is to swap them out for a more representative value, such the dataset's median or mean. If the outliers are valid data points that represent exceptional occurrences or severe events, it may occasionally be desirable to leave them in the dataset.
In order to treat outliers I have opted for IQR method .
The dispersion or variability of a dataset is depicted statistically by the interquartile range (IQR). It is described as the variation between the dataset's upper and lower quartiles.
A quartile is a form of quantile used in statistics that split a dataset into four equal groups. The quartiles separate the data into four categories with an equal number of observations in each of them. Q1, Q2, and Q3 are used to identify the quartiles.
The initial quartile, sometimes referred to as Q1, or the 25th percentile, denotes the value at which 25% of the data are below it and 75% are above it. The median of the data is represented by Q2, also known as the second quartile or the 50th percentile. It is the boundary between the data points where 50% of the data come below and 50% go above. The third quartile, sometimes referred to as Q3, is the point at which 25% of the data is up and 75% of the data is below it.
Mathematically, the IQR can be represented as:
IQR = Q3 - Q1
Observations that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are regarded as outliers when using the IQR method to find outliers. Depending on the study objective and the type of data, this threshold can be changed.
In contrast to other measures of spread like the range or standard deviation, the IQR is a reliable measure of spread that is less susceptible to outliers. It provides a more accurate representation of the normal distribution of the data because it is based on the middle 50% of the data, which is less impacted by extreme numbers.
The IQR is a helpful indicator of a dataset's variability that may be utilised in summary statistics or visualisations like box plots. It offers details about the data's middle 50% and can be used to compare the dispersion of other datasets.
The below code is using IQR to treat outliers and also then displaying Box plots to enable visual analysis
while concluding we can observe that the datasets shape has changed this indicates that outliers were present within this dataset and the code has successfully identified and eliminated those from the respective columns sales ,discount , profit and quantity .
Fig 1 box plot graph
Interpretating the above 4 graphs (fig 1 ) using the methodology . The box graph is an efficient instrument to illustrate how data behaves both in the center and at the extremity of the distribution. The box illustrates how different values are clustered around the median, a measure of central tendency. values that are 1.5 times the interquartile range represent the extreme end or edges of the distribution. Outliers or unexpected values given the distribution's form are values that fall beyond the range.
here for instance in sales we can observe that there is a large number of outliers from the upper end of Q3 (bold black dotted line ) . this thus reduces the quality of information we posses for sales and discourages its use for further analysis . Therefore we would try focus on other variables for our analysis .
Removing outliers using Standard Deviation
Another method to treat outliers can be using standard deviation provided the data follows normal distribution which is true in this case .
A technique frequently used to eliminate outliers from a dataset is standard deviation. The fundamental goal of utilizing standard deviation is to eliminate data values that substantially deviate from the dataset's mean.
The following steps are used
when following these steps analysists must keep in mind that this approach makes the assumption that the data follows normal distribution. Other techniques, such as the interquartile range (IQR) or median absolute deviation (MAD), may be more suited if the data are not normally distributed. Another factor to consider is that eliminating outliers can have a big effect on the dataset's statistical characteristics, like the mean, standard deviation, and correlation coefficients. Therefore, it is crucial to take into account how eliminating outliers would affect the data analysis and interpretation. In some circumstances, it might be preferable to maintain the outliers or employ different techniques to lessen their impact.
Here in the code below we are asking the machine to identify columns that have numerical value since standard deviation can only be applied to there selected columns.
After we have selected numerical columns we define the threshold for normal distribution which is the range as coded below -3 and +3 . Any data point which falls outside this range will be considered an outlier .
Once the condition or threshold has been defined and outliers identified the below code explains which all columns in the dataset have outliers by giving us a value "true" and columns free from outliers as "false" . The output also portray the number of outliers . This gives the analysist clarity as to which variables they must gather further information and perform analytical calculations to get deeper insights in order to increase profitability .
From the output table above we can conclude that outliers are present in columns namely in sales , discount and profit respectively . Sales has the highest number of outliers discouraging us to use this factor while trying to find strategies to increase profits as this data can be considered not reliable . In Contrast we could still use discount as a factor as it has few discrepancies.
2.3 Summary Statistics of the Cleaned data
A set of statistical indicators called summary statistics, commonly referred to as descriptive statistics, provides a brief explanation of the key characteristics of a dataset. These metrics are employed to define a dataset's central tendency, dispersion, and shape.(1. Data Display and Summary | the Bmj, 2020)
Typical summaries of statistics include:
Summary statistics useful as it helps provide a brief summary of the data and can be used to spot any correlations or interconnections between variables. Analysts can discuss the major aspects of the data to others by simplifying it into a small number of critical figures. By sharing details about a dataset's properties summary statistics can aid in decision-making. For instance, if a dataset's mean differs greatly from a reference value, it might indicate that the method used to produce the data was inaccurate. Summary data might serve as a starting point for additional investigation. A dataset's mean and standard deviation, for instance, might be used to compute confidence intervals, run hypothesis tests, or create predictive models.
Below code runs the function for summary dataset .
Interpretating the above tables we need to compare the median , mode and mean of each column to understand if these are in link to one another . This will explain if the data follows normal distribution and the curve is not unusual . If that mean varies a lot as compared to the mode or median then there is a high possibility that the data might me skewed .
We can also use standard deviation to understand the dataset . If the Standard deviation is huge this indicates that the data is spread nearby to the mean .
Similarly range can be employed for analysis . Range is the difference between the maximum and minimum value of the variable . Greater the range there is higher variation within the dataset . For instance in the given dataset both sales and profit have a higher range and thus this may indicate that there is higher variability within these variables.
2.4 Visually plotting the data
Data can be classified into two types continuous and categorical .
continuous data is a type of quantifiable or numeric data in which the outputs can be any figure inside a given range such as temperature, weight, and height. In order to visualize continuous data, histograms, scatter plots, and box plots are frequently used.(Types of Data in Statistics [with Example], n.d.)
Data that may be labelled are qualitative or countable data whose results can be divided into different groups such as gender, eye colour, and race . Graphs like bar charts, pie charts, or frequency tables are frequently used to display categorical data.(Types of Data in Statistics [with Example], n.d.)
All four variables profit , sales , discount and quantity can be used as a continuous data for analysis . Both discount and quantity can be a reliable source to base our strategies of profit maximization in contrast to sales which has a number of outliers and other issues thus limiting the validity of its information . Now we proceed to use sales volume also known as quantity and present a histogram for the same .
The below code is used to first import Plotly Express as px, to create visual tools like scatter graph bar chart histograms etc .
fig 2 - histogram - x= quantity and y = sum of profits
In the histogram above (fig 2) we express x axis as sales quantity or volume and y axis as sum of profits to understand what quantities the product were sold to increase profits . Here from the graph we can see that the highest revenue is earned from quantity bundles 2 to 3 . When going back to the information provided we can observe that products such as stationery home appliances and electronics that were majorly sold in 2 or 3 quantity bundles are evidenced to eb profitable . Hence superstore can focus on building greater stock of these to ensure availability so that the consumer does not shift to competitors in an event of stockout . Also superstore could aim to keep larger variety of these products and make promotional bundles for example "buy 2 get 1 free " to tempt the customer to buy these products and this will increase the revenue for superstores .
For categorial data we have selected location that is "state" and "segment" to base our profitability analysis and below is the code for scatter diagram .
Fig 3 - scatter graph (categorial data)
In an attempt to interpret the diagram (fig 3) we have 3 factors to consider segment , state and profit . We have expresses x axis as the different states within which superstores operate in the USA . And y axis as the three segments of products offered by superstores within the states . Different color are used to show various ranges of profits for intsnace blue shows loss above 20K and yellow shows profits more than 60K .
Visually analyzing this graph
In the home and office segment the most profitable state is Idaho earning a profit grater than 60k , then newyork with 46K , louisanaia with 41K . In corporate segment highest three profitable stores are colorado with earnings of 64k , south carolina with 62K and then Arkansas with 61K . Similarly in consumer segment Colrado with 64k ,Nebraska with 50k and connecticut with 40k. This analysis provides insights regarding which location or state has the highest demand for a particular segment. This information can guide superstores regarding the most profitable states of a particular segment and encourage them to focus on increasing revenue from the following states by taking steps like opening more branches or stores for the identified segment .
Also just by looking at the graph and locating the yellow scatter points (profit more than 64k) one can clearly point out that corporate segment is the most popular one among the three segments. Similarly Colorado state is the locality in which superstore is able to earn the highest profits (64k in consumer segment and 62k in corporate segment ) as compared to the other states . Using this information superstores could open more branches within this state of dig in further to understand the rationale for this and try and deploy the same business strategy in other branches in other states to increase its profits . They could also look at the worst performing branch within a state and take measures to either improve its performance or shut it down and use that money in other investments.
Visual analysis on these graphs may not provide detail information but they help in determining the areas to explore for further insight thus saving management time and work . Also graphs make it easier to understand and communicate to managers not have the same background or expertise as a data analysist to why a certain strategy was proposed .
2.5 Modelling Strategy
A modelling strategy is a methodical method for developing and improving prediction models. The kind of issue that we are trying to address, the data at your disposal, and the objective of the model all play a role in selecting an effective modelling technique. Some of the popular tools are
The issue in question and the data that are accessible must be carefully considered before selecting the best modelling technique.
For our analysis we could use K mean clustering method an Unsupervised artificial intelligence method K-means clustering serves to combine together identical data points depending on their commonalities. The procedure of dividing the data into K number of clusters is iterative. The user chooses the array of clusters depending on topic expertise or experimentation. (Python, n.d.)
Benefits of K mean
K-means clustering is an effective tool for analyzing exploratory data, but before applying it to a particular issue subject, it is crucial to thoroughly analyse its constraints and underlying assumptions which are
K mean is a good method to use but we are using Linear Regression for this report.
Linear regression
A phenomenon with a linear objective function and linear constraints can be solved using the mathematical optimisation technique known as linear programming. When a linear objective function must be maximised or minimised while being subject to a number of linear constraints, this technique is utilised to tackle the issue. Identifying the values of a set of decision variables that meet a set of constraints and optimise the specified objective function is, in other words, involved.
An effective statistical technique for data processing, linear regression has the following advantages: (Satyavishnumolakala, 2020)
limitations of linear regression
The linearity of the connection between the independent and dependent variables is an assumption made by linear regression. This indicates that linear regression may not be the optimal method if the relationship between the variables is not linear. Outliers can significantly affect the regression line and the predictions because linear regression is sensitive to them. Before applying linear regression, outliers must be found and dealt with. Overfitting: If the model is overly complicated, it may result in a model that fits training data quite well but does not translate well to fresh data. (Satyavishnumolakala, 2020)
Below we have run the code for linear regression deploying seaborn scatter plot .
Fig 4 linear regression map
Above is the correlation headmap (fig 4 ) which defines the correlation between any two factors. this method assumes that Positive one denotes perfect correlation which means if one independent variable x = quantity and another dependent variable y = profit , if Y increase when x increases then this relationship is positively correlated. On the otherhand if suppose x = discount and y= revenue and the correlation value is negative 1 then this implies that there is a perfect negative correlation between the two variables . This means that if discounts increases then revenue earned reduces . But if the correlation is zero this indicates that there exists no relationship between the two variables.
From the chart presented above we can see a positive correlation of +0.4 between variables sales and profit . This gives us understanding that we must dig deeper and take out more data about sales as this will help increase the aim of maximum profitability .
Fig 5 Correlation scatter graph with line
The graph above (fig 5 ) visually presents the positive linear regression relationship between sales and profit as we see a diagonal red line. The scattered dots represent the various data points with the dataset and we see that majority are clustered around in a certain area mostly the mean , which indicated that the data is substantially reliable to be used the the results are not greatly distorted . In the graph we do see a few scatter points located away from the cluster these can be identified as outliers and can then be treated .
The statistical findings of a linear regression study are presented in regression tables, which also provide estimates of the coefficients, their standard errors, t-values, p-values, and other statistics. A regression table can be interpreted as follows:
Firstly the table identifies the dependent variable in the dataset here in superstores case it is Profit .
After this we interpret the coefficients listed in the table . The coefficients, also known as regression coefficients, beta coefficients, or weights, show how much the dependent variable fluctuates when the independent variables change by one unit. An increase in one variable causes an increase in the other, as indicated by a positive coefficient, while a negative coefficient denotes an inverse relationship (an increase in one variable corresponds to a decrease in the other). The intensity of the association is indicated by the coefficient's magnitude. Here a strong positive relationship is noticed in segment +1.32 . This explains as there is increase in the quantity of sales of a segment the profits increase. Similarly negative relationship exists between returned goods of -1.76 which shows that if the returns increases then this will reduce the profits . This analysis can help to understand the strength of the relationship between variables and in what intensity can a variable affect profit .
Then we interpret the t value . The volatility of the computed coefficients is evaluated by the standard errors. By dividing the coefficient estimate by its standard error, the t-values are computed. There is greater evidence against the null hypothesis that the coefficient is zero the higher the absolute magnitude of the t-value (i.e., there is a significant relationship between the independent and dependent variables). Here sales has the highest t value of 55 which indicates that it is the most important factor amongst the others and can highly affect the profit , this information is what superstore managers must be cautious about .
Interpret the p values . If the null hypothesis is correct, the p-values indicate the likelihood of obtaining a t-value that is as extreme as the one that was seen. To determine whether a coefficient is statistically significant, typically need a p-value of less than 0.05 . Here the highest p values are of returns 0.80 and quantity =0.90 , this explains that there is less than a 5% chance of achieving a t-value that extreme by chance alone .
analyzing r squares - The coefficient of determination, or R-squared, measures how much variation in the dependant variable can be accounted for by the independent variables. A number of 0 denotes that no variation is explained, whereas a value of 1 denotes complete explanation of the variance. Higher R-squared values typically suggest a better fit between the model and the data. This was already analyzed above previously in the report.
interpret the F- static . The F-statistic evaluates the regression model's overall significance. A big F-statistic (and a small p-value) indicates that the dependent variable and at least one independent variable are significantly correlated.
The regression table summarizes the outcomes of the linear regression analysis and makes it possible to decipher the connections between the independent and dependent variables.
3. Conclusion
In summary from the analysis done above we could identify and visually present the various factors that affect the profitability of the client superstores . Here we defined the variables that have a relationship and the strength of their relationship to understand what the client must focus on or what areas could we as analysis dig deeper to get better insights. As noticed superstore must aim to increase its sales products in segment identified as consumer goods in countries such as coloradio and newyork which are earning the highest profits among the various other branches . These insights were developed because of using advanced modelling strategies and tools . Given further and more detailed information better plan of action could be proposed .
4. References
Payne, J. (2022, April 6). Benefits of python programming language. Developer.Com. https://www.developer.com/languages/python/python-benefits/ Porritt, S. (2023, March 21). Data cleaning: Definition, methods & steps. TechnologyAdvice. https://technologyadvice.com/blog/information-technology/data-cleaning/ Database concepts and structures: The elements that make up a database - video & lesson transcript. (n.d.). Study.Com. Retrieved March 27, 2023, from https://study.com/academy/lesson/database-concepts-and-structures-the-elements-that-make-up-a-database.html Guide to data cleaning: Definition, benefits, components, and how to clean your data. (n.d.). Tableau. Retrieved March 27, 2023, from https://www.tableau.com/learn/articles/what-is-data-cleaning Suresh, A. (2020, December 1). How to remove outliers for machine learning? Analytics Vidhya. https://medium.com/analytics-vidhya/how-to-remove-outliers-for-machine-learning-24620c4657e8 1. Data display and summary | the bmj. (2020, October 28). The BMJ | The BMJ: Leading General Medical Journal. Research. Education. Comment. https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/1-data-display-and-summary Types of data in statistics [with example]. (n.d.). Retrieved March 27, 2023, from https://www.knowledgehut.com/blog/data-science/types-of-data-in-statistics Gowani, I. (2022, June 24). Data modeling techniques and best practices. DATAVERSITY. https://www.dataversity.net/data-modeling-techniques-and-best-practices/ Python, R. (n.d.). K-means clustering in python: A practical guide – real python. Retrieved March 27, 2023, from https://realpython.com/k-means-clustering-python/ Satyavishnumolakala. (2020, June 12). Linear regression -pros & cons. Medium. https://medium.com/@satyavishnumolakala/linear-regression-pros-cons-62085314aef0