C11BD Big Data Analytics 2023-2024: Individual Coursework 2

Prepared by: H00440922

Date: 18/03/2024

Introduction

The present study focuses on doing big data analysis on a dataset obtained from a superstore model, with the objective of identifying valuable insights to improve its profitability. According to Yeng et al. (2022), the dataset includes a comprehensive variety of information that includes the identifiers for columns, orders, clients, and objects. The data analysis of the super store model should encompass various aspects, including the organisation of dispatch dates, shipping mode, client and item details categorised by different groups, sales information, rebate levels, and return status. The data collection has undergone thorough examination, beginning with the acquisition and refinement of information to rectify any errors in the information section and eliminate any anomalies (Hussain et al., 2022). The thoroughness of the cleaning preparation is crucial in guaranteeing the accuracy of the assessment, and the strategy for addressing any deviations will be thoroughly deliberated and supported. The examination of the data collection should conclude with the use of a modelling approach, such as regression, k-means clustering, and classification, which has discovered significant aspects that impact the company's profitability.

Data Importing

The required libraries, namely pandas, matplotlib, and seaborn, are imported. Python libraries are commonly employed for the purposes of data manipulation, visualisation, and machine learning. The libraries provided by pandas have been utilised for the purposes of data manipulation and analysis. The given dataset is utilised in data structures and mathematical processes to manipulate numerical tables and time series data. The matplotlib.pyplot and seaborn libraries have been utilised for the purpose of data visualisation. It is recommended that these libraries include a diverse range of plotting capabilities in order to enhance the informativeness and design of the graphs. The train_test_split method imports the sklearn.model_selection model to divide the data into training and testing sets. This process is crucial for optimising model performance. The sklearn.linear_model's linear regression model was utilised as a machine learning approach to establish linear correlations. The linear regression model was evaluated using the mean_squared_error and r2_score functions from the sklearn.metrics package. These metrics were employed to assess the disparities between the predicted and actual values, as well as to determine the proportion of variation accounted for by the model

#importing neccessary libraries import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy import stats import matplotlib.cm as cm from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.metrics import classification_report, accuracy_score from sklearn.metrics import confusion_matrix from sklearn.metrics import roc_curve, auc

Run to view results

The code has been implemented using Python libraries and is responsible for loading the provided data set into the database. Data analysis has been employed in the context of superstore models to examine customer acquisition cost and facilitate model design. The database is populated with the data set "dataset_Superstore.csv" using the pandas.read_csv function. The dataset manifests as a spreadsheet containing sales orders, encompassing essential details such as order ID, ship date, client name, product category, and other relevant information. The data set has been loaded for the purpose of analysing business structure in the superstore model.

#read dataset into database df = pd.read_csv("dataset_Superstore.csv")

Run to view results

df.head()

Run to view results

Checking for missing values

The provided code outputs the null value obtained from a database table. The client orders have been subjected to a data cleaning methodology and data analysis process to assure the integrity and usability of the data for a certain superstore data analytic model. This study examines the data set utilised in each column of the database to identify missing values that exhibit significant deviations from predicted trends. The inclusion of data sets such as "Order ID," "Order Date," "Ship Date," and client details is necessary. The isnull().sum() method has been employed in the utilisation of Python libraries to demonstrate the model that describes the number of missing values across various columns. The presence of 10 missing values adjacent to the "Customer Name" field suggests the existence of 10 null values, indicating the absence of recorded customer names. The process of data imputation involves the replacement of missing values with alternative data, such as the median or mean of the column. Alternatively, more advanced techniques like machine learning algorithms can be employed to forecast missing values based on other data points. To address outliers in the data collection, it is necessary to either eliminate the missing values that are regarded erroneous or employ robust statistical procedures.

print(df.isnull().sum())

Run to view results

Counts of Outliners in Numerical column

The provided code employs the Matplotlib library to generate a bar plot that visually represents the frequency of exceptions in numerical columns. The process commences by generating a colour outline, sometimes referred to as a 'rainbow', that corresponds in length to the amount of numerical columns. At this juncture, the figure is established and a bar chart is generated using the exception tallies, wherein each bar is assigned a colour from the colour outline. Additional plot elements include a title, designations for the x and y tomahawks, rotated x-axis labels to enhance coherence, framework lines along the y-axis, and explanations for each bar illustrating the exception check. Finally, the format is modified and the plot is displayed

numerical_cols = ['Sales', 'Quantity', 'Discount', 'Profit'] z_scores = stats.zscore(df[numerical_cols]) threshold = 3 outlier_mask = (abs(z_scores) > threshold) outlier_counts = outlier_mask.sum(axis=0) colors = cm.get_cmap('rainbow', len(numerical_cols)) plt.figure(figsize=(10, 6)) bar_plot = outlier_counts.plot(kind='bar', color=colors(range(len(numerical_cols)))) plt.title('Counts of Outliers in Numerical Columns') plt.xlabel('Numerical Columns') plt.ylabel('Number of Outliers') plt.xticks(rotation=45) plt.grid(axis='y') for index, value in enumerate(outlier_counts): plt.text(index, value + 0.1, str(value), ha='center', va='bottom') plt.tight_layout() plt.show()

Run to view results

In the data analytics model, the following code outlines the summary statistic for data cleansing. The comprehension of the distribution, central tendency, and variability within a data collection has been characterised as crucial. The count of rows in the data set is determined by the total number of rows. The count for the variable "Customer_no" is 9994, indicating that there are a total of 9994 entries assigned to customers inside the dataset.

print(df.describe())

Run to view results

Data Visualization

The following code demonstrates the implementation of a design that computes and presents the mean profit per product category based on the provided dataset within the superstore model. The data set was grouped based on the 'Category' column using Python's Seaborn and Matplotlib modules. The average profit for each category was then explained using the variable average_profit_by_category. Seaborn's barplot function was utilised to create a bar chart. The x-axis represents the product category, while the y-axis represents the average profit, which is derived from the data set average_profit_by_category.

import seaborn as sns import matplotlib.pyplot as plt # Calculate average profit by category average_profit_by_category = df.groupby('Category')['Profit'].mean().reset_index() # Plot plt.figure(figsize=(10, 6)) sns.barplot(x='Category', y='Profit', data=average_profit_by_category) plt.title('Average Profit by Product Category') plt.xlabel('Product Category') plt.ylabel('Average Profit') plt.xticks(rotation=45) plt.show()

Run to view results

The provided code demonstrates the generation of a scatterplot that illustrates the relationship between sales and profit within the dataset. The horizontal axis corresponds to sales, while the vertical axis corresponds to the variable "Profit". The plot, titled "Sales vs. Profit," is generated using the plt.title function. The x-axis represents "Sales" and the y-axis represents "Profit" using the plt.xlabel and plt.ylabel functions, respectively. The sales range in the model spans from 0 to 20000, while the profit range is from -5000 to 7500.

#sales vs profit by continuous scatter plot plt.figure(figsize=(5, 3)) sns.scatterplot(x='Sales', y='Profit', data=df) plt.title('Sales vs. Profit') plt.xlabel('Sales') plt.ylabel('Profit') plt.show()

Run to view results

Modelling Strategy

This code demonstrates the application of logistic regression in the data analytics model for predicting supermarket sales, namely in the classification report. A data frame has been presented, containing several potential variables, namely 'Sales', 'Quantity', 'Discount', 'Category_no', and 'Segment_no'. The dataset has been divided into two subsets: the target variable and the highlights. These subsets are currently part of the training and testing sets. The classification report provides an analysis of the precision, recall, f1-score, and support performance of the data frame, resulting in an accuracy of 0.93.

df['Is_Profitable'] = (df['Profit'] > 0).astype(int) features = ['Sales', 'Quantity', 'Discount', 'Category_no', 'Segment_no'] X = df[features] y = df['Is_Profitable'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) model = LogisticRegression() model.fit(X_train_scaled, y_train) y_pred = model.predict(X_test_scaled) accuracy = accuracy_score(y_test, y_pred) print("Classification Report:\n", classification_report(y_test, y_pred)) print("Accuracy:", accuracy)

Run to view results

The provided code demonstrates the utilisation of a confusion matrix in the field of data analytics, specifically in the context of machine learning, to develop an algorithm. The true values of the target variable were represented by the rows, while the expected values were represented by the columns. The columns on the diagonal of the table represent the count of adjusted expectations, whereas the columns off the diagonal represent the count of faulty predictions. The confusion matrix in the top-left component displays the frequency of accurate predictions made by the computation, while the bottom-right component indicates the frequency of inaccurate predictions made by the calculation.

conf_matrix = confusion_matrix(y_test, y_pred) print("Confusion Matrix:\n", conf_matrix)

Run to view results

The following code demonstrates the evaluation of the Receiver Operating Characteristic (ROC) curve graphic for superstore model predictions using the sklearn.metrics package. The false positive rate (fpr), true positive rate (tpr), and threshold for the receiver operating characteristic (ROC) curve have been computed. The roc_curve has been employed, which should be derived from the model that predicts the test dataset. The ROC curve, performed using matplotlib.pyplot, illustrates the model's performance by displaying the balance between the true positive rate and false positive rate, with the AUC value indicated in the legend. The x-axis corresponds to the rate of false positives, while the y-axis corresponds to the rate of real positives. The ROC curve area of 0.94 suggests that the classifier is operating satisfactorily.

fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test_scaled)[:, 1]) roc_auc = auc(fpr, tpr) plt.figure(figsize=(7, 7)) plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})') plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend(loc="lower right") plt.show()

Run to view results

Model Interpretation

The comprehensive big data analysis employed a systematic methodology to enhance the profitability of a superstore. This involved conducting thorough data analysis, cleansing, visualisation, and modelling techniques using the model's dataset. The researchers retrieved, cleaned, visualised, and modelled the data using different Python libraries and methodologies in order to uncover the elements that impact profitability (Ghosh et al., 2020). According to Machado et al. (2023), the utilisation of libraries such as pandas for data manipulation, matplotlib and seaborn for data visualisation, and sklearn for machine learning establishes a robust framework for conducting data analysis. The utilisation of these library resources is crucial in the management of extensive data sets, the visualisation of patterns, and the application of statistical models. The analytic technique has incorporated the superstore data from a CSV file, including essential information such as order ID, delivery date, customer name, and product category. This data set is vital for comprehending sales patterns (Zamil et al., 2020). The utilisation of big data analysis has been employed to augment the profitability of the superstore through the prioritisation of lucrative product categories, the optimisation of discount methods, and the comprehension of customer preferences. The utilisation of logistic regression in the application has demonstrated its efficacy in finding aspects that contribute to profitability.

Conclusion

The superstore model has been bolstering profitability through meticulous data analysis, data cleansing, data visualisation, and modelling tactics, facilitated by big data analysis. The required libraries for large data analytics in the superstore model are imported in the Python programming language using pandas, matplotlib, and seaborn. In the realm of data processing, data visualisation, and machine learning, the utilisation of Python libraries has been prevalent. The model has utilised the pandas libraries for data manipulation and analysis. The application of these tools for data management, matplotlib and seaborn for data visualisation, and sklearn for machine learning has been developed for data analysis. The organisation has made enhancements to its marketing and sales strategies focusing on the most lucrative product categories. This may encompass inventory modifications, improved promotional strategies, and targeted advertising to drive sales within this particular business model. The effective utilisation of logistic regression underscores the need of making decisions based on facts. To further boost profitability, it is recommended that the corporation persist in utilising analytics in many operational domains, including customer segmentation, inventory management, and demand forecasting.

References

Yeng, L.J., Rani, M.N.A. and Radzuan, N.F.M., 2022. Data Analytics Model for Home Improvement Store. In Advances on Smart and Soft Computing: Proceedings of ICACIn 2021 (pp. 185-198). Springer Singapore.

Hussain, S. and Kalaimani, G., 2022. Predictive Analysis for Big Mart Sales Using ML Algorithms.

Ghosh, S. and Neha, K., 2020. Sales Analysis and Performance of Super Store Using Qlik GeoAnalytics. In Advances in Computational Intelligence and Informatics: Proceedings of ICACII 2019 (pp. 151-157). Springer Singapore.

Machado, J.D.F.U., 2023. New Challenges in Official Statistics: Big Data Analytics and Multi-level Product Classification of Web Scraped Data.

Zamil, A.M.A., Al Adwan, A. and Vasista, T.G., 2020. Enhancing customer loyalty with market basket analysis using innovative methods: a python implementation approach. International Journal of Innovation, Creativity and Change, 14(2), pp.1351-1368. (Machado et al. 2023)

Run to view results

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}C11BD Big Data Analytics 2023-2024: Individual Coursework 2

Introduction

Data Importing

Checking for missing values

Counts of Outliners in Numerical column

Data Visualization

Modelling Strategy

Model Interpretation

Conclusion

References

C11BD Big Data Analytics 2023-2024: Individual Coursework 2