C11BD-Big Data Analytics (Individual Course Work 2)

Prepared by: H00441931

Prepared on: 18/03/2024

Link:https://deepnote.com/app/hwu-9db7/SANGEETH-ESWARANs-Coursework-2titled-project-be11eda5-c24b-430e-9ef6-fcd60e526103

Introduction

This report aims to examine the ways to increase profit of superstore by big data analysis (BDA) methods with the help of Python programming software of Deep note. Business uses BDA to make strategic decisions by converting unorganized data to practical outcomes for obtaining operational or business intelligence (Niu, et al., 2021).BDA can be carried out on Superstore dataset by data analytics process of Data cleansing; data exploration and data mining to efficiently analyze data from every aspect or dimension for the business growth strategy plan.

Importing Big Data

Step 1: New notebook can be created in the Deepnote to upload the csv file in the file section present in a left side bar for reading the csv file on software.

Step 2: To create a code space by clicking on code tab and import all the required libraries and use as keyword to define the name according to the model.

• Pandas is imported to read and assign the data of the csv file to the data frame of data as per the code. And it is a strong collection of built-in function for data analysis and manipulation tool built to effectively use python programming language that is used for working with data set. By using panda’s library any function can be performed on data regardless of their type (Chen & Betancourt, 2019) . In addition, Panda’s library can be used to covert the raw data into dataframe by using 'DataFrame' function.

Step 3: To obtain excel data details of rows and columns counts by run the command after uploading the csv file in the Terminal.

# Importing the required libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, mean_absolute_error from collections import Counter from sklearn.ensemble import RandomForestRegressor

Run to view results

# Read the dataset data = pd.read_csv('dataset_Superstore.csv')

Run to view results

Data Cleaning

Data cleaning process generally involves conversion of clumsy data into organized data through identification of missing data, outliers, inconsistent data with a proper statistical and programming functions (Brownlee, 2020)

Benefit of Data Cleaning

• Proper or accurate insights can be formed for efficient decisions. • Save time and Productivity • Error or inconsistent data can be removed at the beginning of data analysis

Process in Data cleaning

• Remove noisy data • remove outliers • fill in missing values • correct inconsistencies in data (Tsai, et al., 2016)

Using python language data cleaning can be processed promptly and efficiently than convention data cleaning methods (Bharathi, 2022). For instance, missing values and outliers can be analyzed effortlessly using in-built functions as well as custom functions of Python Missing values in data can be analyzed using null function () that helps to identify number of null values within dataset and sum () function can be used to count the missing values present in the respective column of dataset. As per the code below, missing values were analyzed using data.isnull().sum() and it shows that there is no missing data in superstore dataset.

# Data Cleaning # Checking for missing values missing_values = data.isnull().sum() print(missing_values)

Run to view results

On the other hand, outlier occurs when there is a sudden increase/ decrease in a particular data. An unusual value can be replaced with their mean or median value while doing the analysis. One way to check outlier is by using Box plot can also be called as whisker plot. A box plot is a one type of way to visualize the distribution of data and its components significance

# Use Box Plot for Sales to check for outliers plt.boxplot(data["Sales"]) plt.xlabel("Sales") plt.ylabel("Value") plt.title("Boxplot of Sales") plt.show()

Run to view results

the function employs the Interquartile Range (IQR) method to identify outliers in a dataset. It takes two parameters: the dataframe containing the data and the feature(s) for which outliers are to be detected. The function calculates the 25th percentile (Q1) and 75th percentile (Q3) for each feature in the dataset. Q1 indicates the value under which 25% of the data points lie, while Q3 indicates the value above where 75% of the data points lie. A variation between Q3 and Q1 is then calculated to obtain the IQR. To define the threshold for outliers, the function computes the outlier step, which is 1.5 times the IQR. The function utilizes a Counter object to tally the occurrences of each outlier index.

A boxplot was created to illustrate distribution of sales values, effectively highlighting the presence of outliers within the dataset.

# Checking for data outliers function. def detect_outliers(data, features): outlier_indices = [] for f in features: # 1st quartile Q1 = np.percentile(data[f], 25) # 3rd quartile Q3 = np.percentile(data[f], 75) # IQR IQR = Q3 - Q1 # Outlier step outlier_step = IQR * 1.5 # Detect outlier and their indices outlier_list_col = data[(data[f] < Q1 - outlier_step) | (data[f] > Q3 + outlier_step)].index # Store indices outlier_indices.extend(outlier_list_col) outlier_indices = Counter(outlier_indices) multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2) return multiple_outliers

Run to view results

# Check outliers outliers = data.loc[detect_outliers(data, ["Sales", "Profit", "Discount"])]

Run to view results

By invoking the detect outliers function with the dataframe and features as arguments, the table provides the summary of the data distribution with IQR percentage for all variables in the dataset.

Data Summary Statistics and Modification

Generally, summary statistics can be used to obtain quick abstract of varied data structure in the form of mean, median, mode and Quartiles. It simply outlines the large amount of data in a concise format to increase data comprehensibility. describe () function can be used to summaries data in python which provides a concise overview of the statistic properties of csv data. It gives the distribution of the data without extensive calculations or visualization. (Downey, 2019)

# Data Modification and Summary Statistics # Modify the 'Returned' column to 1 and 0 data['Returned'] = data['Returned'].astype(int) print(data)

Run to view results

Data Visualization

Data visualization helps to enhance the readability of data. It can be visualized differently using various charts, plots to highlight important information or exploring hidden pattern. In python matplotlib and seaborn libraries can be used for data visualization. Matplotlib is a library for creating a broad variety of plots, including line charts, bar charts, scatterplot, histograms and more. Whereas seaborn builds on top of Matplotlib and specifically designed for creating statistical graphs. As the goal is to increase the profit of business, there are certain variables that are directly influencing the increase/decrease in profit. So, to visualize the relation between different variables graphical plots of histogram, bar chart and scatter plot can be used (Mukhiya & Ahmed, 2020)

# Data Summary print(data.describe()) # Data Visualization using bar chart and scatter plot # Bar chart of different segments plt.hist(data['Segment'], density=True) plt.xlabel('Different Segments') plt.ylabel('Density') plt.show()

Run to view results

# Bar chart of sales by segment plt.bar(data['Segment'], data['Sales']) plt.xlabel('Segment') plt.ylabel('Sales') plt.title('Sales distribution by Segment') plt.show()

Run to view results

# Bar chart of profit by segment plt.figure(figsize=[12, 8]) sns.barplot(x="Segment", y="Profit", data=data, palette="Greens") plt.show()

Run to view results

# Bar chart of sales by region plt.figure(figsize=[12, 8]) sns.barplot(x="Region", y="Sales", hue="Category", data=data, palette="Greens") plt.show()

Run to view results

# Bar chart of profit by region plt.figure(figsize=[12, 8]) sns.barplot(x="Region", y="Profit", hue="Category", data=data, palette="Greens") plt.show()

Run to view results

# Bar chart of Category vs. Discount vs. returned plt.figure(figsize=[12, 8]) sns.barplot(x="Category", y="Discount", hue="Returned", data=data, palette="Greens") plt.show()

Run to view results

# Scatter plot of sales vs. profit plt.scatter(data['Sales'], data['Profit']) plt.xlabel('Sales') plt.ylabel('Profit') plt.title('Relationship between Sales and Profit') plt.show()

Run to view results

# Scatter plot of discount vs. profit plt.scatter(data['Discount'], data['Profit']) plt.xlabel('Discount') plt.ylabel('Profit') plt.title('Relationship between Discount and Profit') plt.show()

Run to view results

Since both features are of continuous type, scatter plot was used to showcase the relation between them. From the above figure, there exist positive upward trend patter were observed between profit and sales. Similarly, relation between Discount and profit were plotted using scatter plot and it can be inferred from the above figure that irregular relationship exists between profit and discount.

# Bar chart of Category vs. Profit vs. returned plt.figure(figsize=[12, 8]) sns.barplot(x="Category", y="Profit", hue="Returned", data=data, palette="Oranges") plt.show()

Run to view results

The Figure above showcases the relation among Category, returned and profit. Through this plot it can be notice that the returned Category gives more profit which in turn means that the profit is decreased due to many returned category products.

Data Modelling

For modelling the data, features of categorical variable get assigned a numerical value. Since the Returned column is in Boolean type and the column also depends on the increase/decrease in profit column get modified into 1s and 0s. To convert the value, astype(int) function used to convert the series data into int values refer the below code. This process is a part of panda’s library. Modeling strategy of linear regression and random forest has been incorporated to examine the influencing factors for company profitability.

# Data Modeling using Linear Regression and Random Forest # Training and Testing the model using Linear Regression def train_linear_regression(X, y): x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False) model = LinearRegression() model.fit(x_train, y_train) prediction = model.predict(x_test) intercept = model.intercept_ coefficients = model.coef_ mae = mean_absolute_error(y_test, prediction) return intercept, coefficients, mae

Run to view results

# Relation between target and features def plot_relation(X, y, intercept, coefficients): prediction_y = coefficients * X + intercept plt.scatter(X, y) plt.plot(X, prediction_y, color='red') plt.xlabel(X.columns[0]) plt.ylabel('Profit') plt.title(f'Relation between {X.columns[0]} and Profit using Linear Regression') plt.show()

Run to view results

Linear Regression

Linear Regression is a fundamental statistic technique used to modelling the relation between the measured variable and one or more controlled variable. The profit is directly dependent on the other variables like sales, discount etc.… After training and testing the model as indicated in the code below, a direct relation of sales and profit are plotted using concept of coefficients, intercept and mean absolute error as shown in code below.

# Model and evaluate linear regression X_features = ['Segment_no', 'Region_no', 'Category_no', 'Discount', 'Sales'] for feature in X_features: X = data[[feature]] y = data['Profit'] intercept, coefficients, mae = train_linear_regression(X, y) print("Feature:", feature) print("Intercept:", intercept) print('Coefficients:', coefficients) print('Mean Absolute Error:', mae) plot_relation(X, y, intercept, coefficients)

Run to view results

By obtaining the response from other independent variables on profit dependent variable. It can be noticed that the reference line was in flat manner implying that there was no change in profit except for the case of relation between discount and profit as in that case there was a slight decrease in profit as shown in figure above.

Random Forest

Random forest is a powerful ensemble method, that it relies on aggregating the results of an ensemble of simpler estimators. It is well suited to understand the profitability of company based on various features rather than conventional modeling technique as per Anand, et al., (2019) Firstly import the train_test_split from sklearn.model_selection import train_test_split then create the train and test sets X_toTrain, X_toTest, y_toTrain, y_toTest = train_test_split(X,y, train_size=fraction) where fraction is the fraction used for training An important metric in random forests is the oob_score, or out-of-bag error. It is a way of validating the model, and is the number of correctly predicted rows from our out-of-bag sample. Syntax is: random_forest.oob_score

# Using Random Forest to model the data X = data[['Segment_no', 'Category_no', 'Returned', 'Region_no']] y = data['Profit'] X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7) random_forest = RandomForestRegressor(oob_score=True, max_depth=3, random_state=0) random_forest.fit(X_train, y_train)

Run to view results

The oob score value lies within range of 0 and 1 and value close to 1 indicates better performance. A score of 0.6074 as shown in figure 25 suggests that Random Forest model is performing moderately well on unseen data. Feature importance: Random forests provide the way to know about the importance of features, in featureimportances, they are determined as the average and standard deviation of the collection of the impurity contraction within each tree. It can be observed from figure 26 that sales contribute more to the profit and then it follows order of the discount, region, segment and category in comparison with sales.

# Out of bag score print("Out of Bag Score:", random_forest.oob_score_) # Feature importance importance_df = pd.DataFrame({ "Feature": X_train.columns, "Importance": random_forest.feature_importances_ }) importance_df = importance_df.sort_values(by="Importance", ascending=False) plt.figure(figsize=[10, 6]) sns.barplot(x="Importance", y="Feature", data=importance_df) plt.title("Feature Importances") plt.show() print("Reading CSV file...") data = pd.read_csv('dataset_Superstore.csv') print("CSV file read successfully.") print("Data shape:", data.shape) print("Data head:") print(data.head())

Run to view results

Conclusion

This report indicates that which variable plays an important role in increasing in profit. From data Preprocessing and from the Data Modelling it can be inferred that independent variable sales plays a huge role in profit compared to other variable from the analysis of the random forest feature importance however there are hidden variables like region, Discount, returned and category which can help in increase in profit. To clarify, higher discounts and returned products (especially technology related products) affect the profitability of the company immensely. On the other hand, technology related products and home office customer segment contribute more to the sales and profitability of the company. Thus, creating proper strategy to avoid the returns rate and reduce the discounts on technology products as it has got more sales would be viable option. In addition, price sensitive and other marketing programs should be incorporated to focus on consumer segment.

Reference

Anand, V., Brunner, R., Ikegwu, K. & Sougiannis, T., 2019. Predicting Profitability Using Machine Learning. Market Intelligence, p. 64. Bharathi, N. V., 2022. Data Cleaning Techniques Using Python. AKNU Journal of Science and Technology, 1(1), pp. 11-21. Brownlee, J., 2020. Data Preparation for Machine Learning. s.l.:s.n. Chen, S. & Betancourt, R., 2019. pandas Library. Python for SAS Users, pp. 65-109 ; https://doi.org/10.1007/978-1-4842-5001-3_3. Downey, A. B., 2019. Think Python. 2 ed. s.l.:Green Tea Press. G, P. V. A., K, A. K. & Varadarajan, V., 2021. Estimating Software Development Efforts Using a Random Forest-Based Stacked Ensemble Approach. electronics, 10(10), p. 1195 ; https://doi.org/10.3390/electronics10101195. Larivière, B. & Poel, D. V. d., 2005. Predicting customer retention and profitability by using random forests and regression forests techniques. Expert Systems with Applications, 29(2), pp. 472-283. Mukhiya, S. K. & Ahmed, U., 2020. Hands-On Exploratory Data Analysis with Python: Perform EDA techniques to understand, summarize, and investigate your data. 2 ed. s.l.:Packt. Niu, Y. et al., 2021. Organizational business intelligence and decision making using big data analytics. Information Processing & Management, 58(6). Sonu, S. B. & Suyampulingam, A., 2021. Linear Regression Based Air Quality Data Analysis and Prediction using Python. IEEE, pp. 1-7;doi. Tsai, C.-W., Lai, C.-F., Chao, H.-C. & Vasilakos, A. V., 2016. Big Data Analytics. Big Data Technologies and Applications , pp. 13-52.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}C11BD-Big Data Analytics (Individual Course Work 2)

Introduction

Importing Big Data

Data Cleaning

Benefit of Data Cleaning

Process in Data cleaning

Data Summary Statistics and Modification

Data Visualization

Data Modelling

Linear Regression

Random Forest

Conclusion

Reference

C11BD-Big Data Analytics (Individual Course Work 2)