C11BD-Big Data Analytics (Individual Course Work 2)
Prepared by: H00441931
Prepared on: 18/03/2024
Introduction
This report aims to examine the ways to increase profit of superstore by big data analysis (BDA) methods with the help of Python programming software of Deep note. Business uses BDA to make strategic decisions by converting unorganized data to practical outcomes for obtaining operational or business intelligence (Niu, et al., 2021).BDA can be carried out on Superstore dataset by data analytics process of Data cleansing; data exploration and data mining to efficiently analyze data from every aspect or dimension for the business growth strategy plan.
Importing Big Data
Step 1: New notebook can be created in the Deepnote to upload the csv file in the file section present in a left side bar for reading the csv file on software.
Step 2: To create a code space by clicking on code tab and import all the required libraries and use as keyword to define the name according to the model.
• Pandas is imported to read and assign the data of the csv file to the data frame of data as per the code. And it is a strong collection of built-in function for data analysis and manipulation tool built to effectively use python programming language that is used for working with data set. By using panda’s library any function can be performed on data regardless of their type (Chen & Betancourt, 2019) . In addition, Panda’s library can be used to covert the raw data into dataframe by using 'DataFrame' function.
Step 3: To obtain excel data details of rows and columns counts by run the command after uploading the csv file in the Terminal.
Run to view results
Run to view results
Data Cleaning
Data cleaning process generally involves conversion of clumsy data into organized data through identification of missing data, outliers, inconsistent data with a proper statistical and programming functions (Brownlee, 2020)
Benefit of Data Cleaning
• Proper or accurate insights can be formed for efficient decisions. • Save time and Productivity • Error or inconsistent data can be removed at the beginning of data analysis
Process in Data cleaning
• Remove noisy data • remove outliers • fill in missing values • correct inconsistencies in data (Tsai, et al., 2016)
Using python language data cleaning can be processed promptly and efficiently than convention data cleaning methods (Bharathi, 2022). For instance, missing values and outliers can be analyzed effortlessly using in-built functions as well as custom functions of Python Missing values in data can be analyzed using null function () that helps to identify number of null values within dataset and sum () function can be used to count the missing values present in the respective column of dataset. As per the code below, missing values were analyzed using data.isnull().sum() and it shows that there is no missing data in superstore dataset.
Run to view results
On the other hand, outlier occurs when there is a sudden increase/ decrease in a particular data. An unusual value can be replaced with their mean or median value while doing the analysis. One way to check outlier is by using Box plot can also be called as whisker plot. A box plot is a one type of way to visualize the distribution of data and its components significance
Run to view results
the function employs the Interquartile Range (IQR) method to identify outliers in a dataset. It takes two parameters: the dataframe containing the data and the feature(s) for which outliers are to be detected. The function calculates the 25th percentile (Q1) and 75th percentile (Q3) for each feature in the dataset. Q1 indicates the value under which 25% of the data points lie, while Q3 indicates the value above where 75% of the data points lie. A variation between Q3 and Q1 is then calculated to obtain the IQR. To define the threshold for outliers, the function computes the outlier step, which is 1.5 times the IQR. The function utilizes a Counter object to tally the occurrences of each outlier index.
A boxplot was created to illustrate distribution of sales values, effectively highlighting the presence of outliers within the dataset.
Run to view results
Run to view results
By invoking the detect outliers function with the dataframe and features as arguments, the table provides the summary of the data distribution with IQR percentage for all variables in the dataset.
Data Summary Statistics and Modification
Generally, summary statistics can be used to obtain quick abstract of varied data structure in the form of mean, median, mode and Quartiles. It simply outlines the large amount of data in a concise format to increase data comprehensibility. describe () function can be used to summaries data in python which provides a concise overview of the statistic properties of csv data. It gives the distribution of the data without extensive calculations or visualization. (Downey, 2019)
Run to view results
Data Visualization
Data visualization helps to enhance the readability of data. It can be visualized differently using various charts, plots to highlight important information or exploring hidden pattern. In python matplotlib and seaborn libraries can be used for data visualization. Matplotlib is a library for creating a broad variety of plots, including line charts, bar charts, scatterplot, histograms and more. Whereas seaborn builds on top of Matplotlib and specifically designed for creating statistical graphs. As the goal is to increase the profit of business, there are certain variables that are directly influencing the increase/decrease in profit. So, to visualize the relation between different variables graphical plots of histogram, bar chart and scatter plot can be used (Mukhiya & Ahmed, 2020)
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Since both features are of continuous type, scatter plot was used to showcase the relation between them. From the above figure, there exist positive upward trend patter were observed between profit and sales. Similarly, relation between Discount and profit were plotted using scatter plot and it can be inferred from the above figure that irregular relationship exists between profit and discount.
Run to view results
The Figure above showcases the relation among Category, returned and profit. Through this plot it can be notice that the returned Category gives more profit which in turn means that the profit is decreased due to many returned category products.
Data Modelling
For modelling the data, features of categorical variable get assigned a numerical value. Since the Returned column is in Boolean type and the column also depends on the increase/decrease in profit column get modified into 1s and 0s. To convert the value, astype(int) function used to convert the series data into int values refer the below code. This process is a part of panda’s library. Modeling strategy of linear regression and random forest has been incorporated to examine the influencing factors for company profitability.
Run to view results
Run to view results
Linear Regression
Linear Regression is a fundamental statistic technique used to modelling the relation between the measured variable and one or more controlled variable. The profit is directly dependent on the other variables like sales, discount etc.… After training and testing the model as indicated in the code below, a direct relation of sales and profit are plotted using concept of coefficients, intercept and mean absolute error as shown in code below.
Run to view results
By obtaining the response from other independent variables on profit dependent variable. It can be noticed that the reference line was in flat manner implying that there was no change in profit except for the case of relation between discount and profit as in that case there was a slight decrease in profit as shown in figure above.
Random Forest
Random forest is a powerful ensemble method, that it relies on aggregating the results of an ensemble of simpler estimators. It is well suited to understand the profitability of company based on various features rather than conventional modeling technique as per Anand, et al., (2019) Firstly import the train_test_split from sklearn.model_selection import train_test_split then create the train and test sets X_toTrain, X_toTest, y_toTrain, y_toTest = train_test_split(X,y, train_size=fraction) where fraction is the fraction used for training An important metric in random forests is the oob_score, or out-of-bag error. It is a way of validating the model, and is the number of correctly predicted rows from our out-of-bag sample. Syntax is: random_forest.oob_score
Run to view results
The oob score value lies within range of 0 and 1 and value close to 1 indicates better performance. A score of 0.6074 as shown in figure 25 suggests that Random Forest model is performing moderately well on unseen data. Feature importance: Random forests provide the way to know about the importance of features, in featureimportances, they are determined as the average and standard deviation of the collection of the impurity contraction within each tree. It can be observed from figure 26 that sales contribute more to the profit and then it follows order of the discount, region, segment and category in comparison with sales.
Run to view results
Conclusion
This report indicates that which variable plays an important role in increasing in profit. From data Preprocessing and from the Data Modelling it can be inferred that independent variable sales plays a huge role in profit compared to other variable from the analysis of the random forest feature importance however there are hidden variables like region, Discount, returned and category which can help in increase in profit. To clarify, higher discounts and returned products (especially technology related products) affect the profitability of the company immensely. On the other hand, technology related products and home office customer segment contribute more to the sales and profitability of the company. Thus, creating proper strategy to avoid the returns rate and reduce the discounts on technology products as it has got more sales would be viable option. In addition, price sensitive and other marketing programs should be incorporated to focus on consumer segment.
Reference
Anand, V., Brunner, R., Ikegwu, K. & Sougiannis, T., 2019. Predicting Profitability Using Machine Learning. Market Intelligence, p. 64. Bharathi, N. V., 2022. Data Cleaning Techniques Using Python. AKNU Journal of Science and Technology, 1(1), pp. 11-21. Brownlee, J., 2020. Data Preparation for Machine Learning. s.l.:s.n. Chen, S. & Betancourt, R., 2019. pandas Library. Python for SAS Users, pp. 65-109 ; https://doi.org/10.1007/978-1-4842-5001-3_3. Downey, A. B., 2019. Think Python. 2 ed. s.l.:Green Tea Press. G, P. V. A., K, A. K. & Varadarajan, V., 2021. Estimating Software Development Efforts Using a Random Forest-Based Stacked Ensemble Approach. electronics, 10(10), p. 1195 ; https://doi.org/10.3390/electronics10101195. Larivière, B. & Poel, D. V. d., 2005. Predicting customer retention and profitability by using random forests and regression forests techniques. Expert Systems with Applications, 29(2), pp. 472-283. Mukhiya, S. K. & Ahmed, U., 2020. Hands-On Exploratory Data Analysis with Python: Perform EDA techniques to understand, summarize, and investigate your data. 2 ed. s.l.:Packt. Niu, Y. et al., 2021. Organizational business intelligence and decision making using big data analytics. Information Processing & Management, 58(6). Sonu, S. B. & Suyampulingam, A., 2021. Linear Regression Based Air Quality Data Analysis and Prediction using Python. IEEE, pp. 1-7;doi. Tsai, C.-W., Lai, C.-F., Chao, H.-C. & Vasilakos, A. V., 2016. Big Data Analytics. Big Data Technologies and Applications , pp. 13-52.