C11BD: BIG DATA ANALYTICS 2023-2024 - INDIVIDUAL COURSEWORK 2

Introduction

The performance of a retail business is generally evaluated by analyzing their sales and profit margins over time or using predictive statistical models such as regression. This report provides an analytic insight into a retail supermarket’s current business position. Analysis has been performed in Python using potential models as time-series and linear regression. The dataset used in this analysis has several columns containing sales records for the timeframe 2014-01-03 to 2017-12-30. Based on this historical data a categorical bar graph and a continuous scatterplot has been shown during EDA along with a statistical description of the data while during forecasting sales a regression model has been employed with a time-series line plot for the predicted outcomes.

Aim and Objectives

Aim This analytic report aims to analyze the data features stored in a sales record to forecast sales for the timeframe 2017-12-31 to 2021-12-27 using linear regression and visualization of the predicted outcomes using a time-series line plot. Objectives ● To improve data quality by the removal of outliers during pre-processing of data. ● To provide the significance of a few features by means of statistical and visual interpretation. ● To forecast the sales using a linear regression model and visualize it through a time-series plot.

Research Questions

Research Question 1: Detect the outliers in data to improve its quality. Research Question 2: Provide feature significance by statistical description and plotting of a categorical bar graph and a continuous scatterplot. Research Question 3: Forecast sales for the supermarket during the timeframe of 2017-12-31 to 2021-12-27 using linear regression. Visualize the predicted outcomes using a time-series line plot to show whether the predicted sales are increasing or decreasing.

Background

Mathematical and statistical models often turn out to be potential tools for analyzing market positions of a retail business. Linear regression is a statistical model that is usually used in case of continuous data and sales prediction. The forecasting of sales indicates the future market position of a company by means of business growth. Business growth of a company depends on operational sustainability which on the other hand relies on market demand (Ma and Fildes, 2021). For manufacturing businesses, when the rate of production is more than the rate of consumption of a product of high demand, it is said that sustainability is maintained. According to Teoh and Rong, 2022, on the other hand, for a retail business such as supermarkets, sustainability is maintained only if the business manages to balance the “economic”, “environmental”, and “social concerns” without disrupting the chain of market demand and selling of products. The conservation of this sustainability needs thorough knowledge of the current market state. In order to introduce stability in business operations, retail businesses often employ data analysts to provide valuable information of the current situation and expected growth or contraction in businesses. Analysts use business records of their employers to describe current business state and based on which they show whether a business is likely to grow or otherwise (Bauer et al., 2022). Based on such analyses, business strategists plan future aspects to improve the current situation.

Methods

This research is directed to provide the same for the organization which has generated this “Superstore.csv” record on their sales performance. The methodical approach begins with the pre-processing of the data to remove the outliers from the dataset. The analysis proceeds to portray a statistical description of data to show the measures of central tendencies of each variable involved. It further produces a pair of graphs - a categorical bar plot showing the data distribution of a categorical column present in the dataset and a continuous scatterplot showing any dependencies between two indicators of current market status of an organization. Forecasting of sales records has involved the application of a linear regression model while the observation of the predicted sales incorporates a time-series analysis using a line plot. The entire analysis has been performed in Python using a few of its pre-defined functions. Python provides a diverse functionality for prediction models that can be acquired from the “scikit-learn” library. “Pandas”, on the other hand, is a library that makes data handling more flexible and structured with the dataframe-like tabular structure of data and many functions for data type conversion, and operations like merging and joining (Fan et al., 2020). The library “scikit-learn” allows access to regression models that are pre-defined in the “sklearn.linear_models” modules (Prell et al., 2020). The performance metrics for regression models incorporate the measuring of “mean absolute error”, “mean square error”, and “root mean square error” that define the error margin of a regression model (Smiti, 2020). Low error margin indicates greater model accuracies. The linear regression model applied to forecast the sales for the period 2017-12-31 to 2021-12-27, has a general form as follows. y = mx + c Here, “y” is the predicted sales and the response variable that depends on the “x” which is the predictor variable while “c” is the constant term. For a known set of values of x and y, “m” and “c” are calculated first, then for each known value of the predictor, m and c, the values of “y” are produced as predicted outcomes of sales.

Methodical implementation plan (Source: Designed in draw.io) The implementation plan thus includes the removal of the outliers, feature analysis and forecasting using a linear regression model with an aim to predict the future market position of the company.

Implementation, Results and Discussion

import numpy as np import pandas as pd import geopandas as gpd import warnings import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error, mean_squared_error warnings.filterwarnings('ignore')

Run to view results

pip install geopandas

Run to view results

pip install contextily

Run to view results

df = pd.read_csv('dataset_Superstore.csv')

Run to view results

df.head(5)

Run to view results

The dataset for analysis has been loaded into the programming environment with the use of the “read_csv()” function. The function takes the name of the dataset along with its file extension as a positional argument along with other parameters specifying the file encoding, data separator, presence of headers, and so on. The loaded data has a dataframe structure supported by the “Pandas” library.

def describe_data(dataset): print(f"The dataset contains sales record within the timeframe {pd.to_datetime(dataset['Order Date']).min()} - {pd.to_datetime(dataset['Order Date']).max()}.") print(f"The dataset has {dataset.shape[0]} rows, {dataset.shape[1]} columns and {(dataset.shape[0] - 1) * dataset.shape[1]} observations.") print(f"It has a total of {(dataset.isna().sum()).sum()} null values.") print(f"It contains data of {dataset.dtypes.value_counts().count()} different data types - {dataset.dtypes.value_counts().index[0]}, {dataset.dtypes.value_counts().index[1]}, {dataset.dtypes.value_counts().index[2]}, and {dataset.dtypes.value_counts().index[3]}.") print("\nSummary of its structure is:") print(dataset.info())

Run to view results

describe_data(df)

Run to view results

The dataframe is appeared to have 9994 rows, 29 columns and 289797 observations with no null values. There are four types of data present in the dataset are “bool”, “float64”, “int64”, and “object” associated to 1, 3, 10, and 15 columns. Among the columns “Order Date”, “Customer Name”, “Postal Code”, and so on, the column “Sales” is the target column.

def detect_outliers_z_score(data): outliers = [] threshold = 3 # Calculating the z-scores for each column z_scores = np.abs((data - data.mean()) / data.std()) # Iterating through all columns for col in z_scores.columns: column_outliers = z_scores[col][z_scores[col] > threshold] outliers.extend(column_outliers.index.tolist()) # Returning unique outliers return list(set(outliers))

Run to view results

# Selecting only numerical columns for outlier detection numerical_columns = df.select_dtypes(include=['float64', 'int64']) # Detecting outliers with the "detect_outliers_z_score()" function outliers = detect_outliers_z_score(numerical_columns) print("Number of outliers detected:", len(outliers)) print("Indices of outliers:", outliers)

Run to view results

Followed by the loading and description of the dataframe, pre-processing has been performed during which the outliers from the dataset have been detected first as the adjacent figure illustrates. The computation of z-score has helped to identify the outliers - those data points that are scattered at a great distance from the overall mean position of the data points (Alghushairy et al., 2020). The statistical tool, namely “standard deviation”, has been employed during the evaluation of z-scores to measure the distances between each data point from the aforementioned mean position. The data points situated at a greater distance from the mean position, i.e. those having greater “standard deviation” are considered to be outliers. The user-function used in this case calculates the z-scores for all data points and returns the location or indices of those data that have greater SD. The result of the adjacent block of code shows there are a total of 460 outliers present in the dataset along with their indices. Using Python’s pre-defined “drop()” function these outliers are removed from the dataframe to reduce data noise and improve data quality.

df.loc[8247]

Run to view results

type(outliers)

Run to view results

outliers_data = [] for each_index in outliers: outliers_data.append(df.loc[each_index])

Run to view results

outliers_df = pd.DataFrame(outliers_data)

Run to view results

# outliers_df.head()

Run to view results

outliers_df.to_csv('outliers.csv', index = False)

Run to view results

cln_df = df.drop(outliers)

Run to view results

cln_df.reset_index(drop=True, inplace=True)

Run to view results

cln_df.head(5)

Run to view results

cln_df.dtypes

Run to view results

cln_df.drop(['Row ID', 'Order ID', 'Customer ID', 'Customer_no', 'Segment_no', 'State_no', 'Postal Code', 'Region_no', 'Product ID', 'Category_no', 'Sub-Category_no', 'Product Name_no'], axis = 1, inplace = True)

Run to view results

cln_df.dtypes

Run to view results

cln_df['Order Date'] = pd.to_datetime(cln_df['Order Date']) cln_df['Ship Date'] = pd.to_datetime(cln_df['Ship Date'])

Run to view results

#label_map = {'False': 0, 'True': 1} #cln_df['Returned'] = cln_df['Returned'].map(label_map) #cln_df['Returned'].value_counts()

Run to view results

#cln_df['Returned'].value_counts()

Run to view results

cln_df.to_csv('clean.csv', index = False)

Run to view results

cln_df.describe()

Run to view results

The statistical description of the dataframe uses only numeric and date-time columns to calculate the measures of central tendencies of each data point. The mean sales, quantity of goods sold, and profit of the company have appeared to be $180.53, 3.7, and $23.56 respectively. This indicates, by selling an average of 3.7 products, the supermarket makes a sale of $180.53 while manages to gain $23.56. On the other hand, as the average order and shipping date are “2016-04-30 17:09:28.659534336” and “2016-05-04 16:02:34.059156736”, the supermarket delivers goods to their customers on an average delay of 3 days approximately.

# Counting the occurrences of each category value_counts = cln_df['Returned'].value_counts() colors = ['blue', 'red'] # Plotting the value counts of the 'Returned' column ax = value_counts.plot(kind='bar', color=colors) # Defining the labels for i, v in enumerate(value_counts): ax.text(i, v + 50, str(v), ha='center', va='bottom') plt.xlabel('Returned') plt.ylabel('Frequency') plt.title('Frequency of Returned Orders') plt.show()

Run to view results

The “Returned” column stores binary data - “True” and “False” - indicating whether a certain product is returned by a customer or not. While “False” indicates “no order is returned”, “True” refers to the returned orders. Thus a distributed count of the occurrences of these values indicates total orders returned and not returned. As bar plot illustrates, on selling 9534 products, 766 of them are returned by the customers.

plt.figure(figsize=(10, 6)) plt.scatter(cln_df['Sales'], cln_df['Profit'], alpha=0.5, color='blue') plt.title('Scatter Plot of Sales vs Profit') plt.xlabel('Sales') plt.ylabel('Profit') plt.show()

Run to view results

The relational dependency of sales on profit has been described on the adjacent figure. By plotting the sales on x-axis and profit on y-axis the data points have formed a clustering that appears to be diverging from the coordinate (0,0). The cluster is denser at the origin indicating the difference between the profit and sales are very low. The dense cluster has spread more on the positive quadrant while a minimal portion is on the negative quadrant. This implies besides making low profits, the organization also faces losses. Those data points that are situated far from this cluster are the residues of outliers still present in the data. A trace of a linear dependency is found for these two variables within the range [0, 600) though the line has seem to lose its consistency at the coordinate of (1500, 300).

start_date = '2014-01-03' end_date = '2017-12-30' df_filtered = cln_df[(cln_df['Order Date'] >= start_date) & (cln_df['Order Date'] <= end_date)]

Run to view results

df_filtered['Days'] = (df_filtered['Order Date'] - pd.to_datetime(start_date)).dt.days

Run to view results

X = df_filtered[['Days']] y = df_filtered['Sales'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Run to view results

reg = LinearRegression() reg.fit(X_train, y_train)

Run to view results

Linear regression fits a straight line to training data, modeling the relationship between independent variables (features) and a dependent variable (target). It recognizes the straight relationship by minimizing the distinction between anticipated and genuine values. The code likely utilizes X_train as highlight inputs and y_train as target values. Once trained, the model predicts target variables for new data points. This method is fundamental for understanding and predicting relationships between variables in various fields, from economics to machine learning.

# Forecasting sales for the next timeframe of equal interval days_next = np.arange((df_filtered['Days'].max() + 1), (df_filtered['Days'].max() + 1459)).reshape(-1, 1) sales_predicted = reg.predict(days_next)

Run to view results

The above implementation has utilized Python libraries like NumPy and Pandas to foresee future deals (sales_predicted) based on historical data (df_filtered). It forecasts sales for the next 1459 days (days_next). Referring to a "thermometer" allegorically, it can signify a factual show for deals forecast, serving as a gauge to measure deal patterns. This strategy helps in expecting future sales patterns and making informed decisions in business planning and strategy.

# Converting the days back to dates dates_next = pd.to_datetime(start_date) + pd.to_timedelta(days_next.flatten(), unit='D')

Run to view results

predicted_df = pd.DataFrame({'Date': dates_next, 'Predicted Sales': sales_predicted})

Run to view results

predicted_df

Run to view results

The Python code produces a Pandas DataFrame with the title predicted_df. It contains two columns: 'Date' which ranges from December 31, 2017, to December 27, 2021, and 'Predicted Sales' which shows deals insights that connect to the dates. There are 1458 rows of data contained inside the DataFrame.

predicted_df['Date'].max() - predicted_df['Date'].min()

Run to view results

mae = mean_absolute_error(df_filtered['Sales'].iloc[0:1458], predicted_df['Predicted Sales']) mse = mean_squared_error(df_filtered['Sales'].iloc[0:1458], predicted_df['Predicted Sales']) rmse = np.sqrt(mse)

Run to view results

print(f'Mean Absolute Error (MAE): {mae:.2f}') print(f'Mean Squared Error (MSE): {mse:.2f}') print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')

Run to view results

Based on the above results the mean absolute error has been shown as 169.05 and on the other hand the root mean error is defined as 235.34. The code assesses the execution of the model by comparing predicted values with real values. The resulting metrics are shown in the code, providing bits of knowledge about the accuracy and performance of the model.

plt.figure(figsize=(10, 6)) plt.plot(predicted_df['Date'], predicted_df['Predicted Sales'], label='Predicted Sales', color='red') plt.xlabel('Date') plt.ylabel('Sales') plt.title('Predicted Sales Over Time') plt.show()

Run to view results

A line graph is used to illustrate "Predicted Sales Over Time" from January 2018 to July 2022. There is a small rising trend in the sales data along the y-axis, however there is some variability.

Conclusion

The analytic report has confirmed that the company running the supermarket is likely to face a loss during the forthcoming period of 2017-12-31 to 2021-12-27. The company has gained minimum profits during the period 2014-01-03 - 2017-12-30 with decreasing trend of sales. It purchases products from third parties and sells them at an average profit of $23.56. It also has made a loss of $653.28 during the timeframe under observation. With an average delay in delivery of 3 days, the company receives 799 returned orders while selling a total of 9,534 products. The loss for the company during the upcoming time has been predicted to fall below $166 at the end of 2022 starting from $176 at the beginning of 2018.

References

Alghushairy, O., Alsini, R., Soule, T. and Ma, X., 2020. A review of local outlier factor algorithms for outlier detection in big data streams. Big Data and Cognitive Computing, 5(1), p.1. Smiti, A., 2020. A critical overview of outlier detection methods. Computer Science Review, 38, p.100306. Ma, S. and Fildes, R., 2021. Retail sales forecasting with meta-learning. European Journal of Operational Research, 288(1), pp.111-128. Teoh, T.T. and Rong, Z., 2022. Python for Data Analysis. In Artificial Intelligence with Python (pp. 107-122). Singapore: Springer Singapore. Bauer, J.M., Aarestrup, S.C., Hansen, P.G. and Reisch, L.A., 2022. Nudging more sustainable grocery purchases: behavioural innovations in a supermarket setting. Technological Forecasting and Social Change, 179, p.121605. Prell, M., Zanini, M.T., Caldieraro, F. and Migueles, C., 2020. Sustainability certifications and product preference. Marketing Intelligence & Planning, 38(7), pp.893-906. Fan, Y., Kou, J. and Liu, J., 2020, January. Research on the influencing factors of customer loyalty in offline supermarket under new retail model. In Proceedings of the 2020 4th International Conference on Management Engineering, Software Engineering and Service Sciences (pp. 216-220).

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}C11BD: BIG DATA ANALYTICS 2023-2024 - INDIVIDUAL COURSEWORK 2

Introduction

Aim and Objectives

Research Questions

Background

Methods

Implementation, Results and Discussion

Conclusion

References

C11BD: BIG DATA ANALYTICS 2023-2024 - INDIVIDUAL COURSEWORK 2