## BIG DATA COURSEWORK 2 - C11BD

Submitted by:- Parul Tripathi

Student ID:- H00445770

Date:- 18/03/24

## INTRODUCTION & BACKGROUND

Business analytics has become a crucial tool for organizations to gain insights from their data and make data-driven decisions to improve their performance and profitability. In the context of increasing profits, business analytics can help identify the factors that contribute to profitability, optimize pricing and promotional strategies, improve operational efficiency, and enhance customer segmentation and targeting (Shanks et al.).

In this project, we have been provided with a dataset, Superstore.csv, from a company seeking to leverage data analytics to understand how to increase its profits. The dataset contains information about sales transactions, including product details, customer information, and order characteristics. Our objective is to analyze this data using Python and the Deepnote platform to gain insights into the factors driving profitability and provide recommendations to the company.

To achieve this, we will follow a structured approach that includes data importing, data cleaning, exploratory data analysis, and modeling. We will begin by importing the dataset into a Python notebook and performing necessary data cleaning steps to handle missing values, data entry errors, and outliers (Osborne,2010). This will ensure the integrity and reliability of the data for subsequent analysis.

Next, we will calculate summary statistics of the cleaned data to gain an overview of the key variables and their distributions (Ott and Longnecker,2020). This will provide a foundation for further exploratory analysis and modeling.

We will then conduct exploratory data analysis by creating visualizations of the data, including at least one categorical plot (bar chart) and one continuous plot (scatter plot). These visualizations will help us identify patterns, relationships, and potential drivers of profitability (Few,2009).

Based on the insights gained from the exploratory analysis, we will select an appropriate modeling strategy to demonstrate the factors that most contribute to the profitability of the company. We will consider modeling techniques covered in the tutorials, i.e., regression method. The choice of the modeling approach will be justified based on the nature of the data and the specific business problem at hand.

Throughout the modeling process, we will perform necessary data preparation steps and post-modeling diagnostics to ensure the validity and reliability of the results (Hair,2010). This may include feature scaling, handling categorical variables, assessing model performance metrics, and conducting residual analysis.

Finally, we will interpret the results of the modeling exercise and provide meaningful insights and recommendations to the client. We will explain the key factors that impact profitability based on the model findings and suggest strategies to optimize these factors for increased profits.

## IMPORTING THE DATA

Run to view results

The following lines of code show the first steps in importing and inspecting the Superstore dataset with Python's pandas library. Using pandas and the read_csv() function, the "dataset_Superstore.csv" file is read into a DataFrame named df. The DataFrame is a structured and efficient way of storing and manipulating data. Displaying the DataFrame df in an interactive Python environment enables a quick inspection of the imported data, giving an overview of the columns and their values (McKinsey, 2010). These steps lay the groundwork for future data exploration, cleaning, and analysis tasks utilising the pandas library's powerful functions and methods.

## DATA SUMMARY

Run to view results

Data summary is a crucial step in the data analysis process, as it provides a concise and informative overview of the main characteristics and statistical properties of the dataset. When working with the Superstore dataset, calculating summary statistics helps in understanding the distribution, central tendency, and variability of the variables, which is essential for making informed decisions and drawing meaningful conclusions(Mann, 2010).

The df.info() function displays a concise summary of the DataFrame df. It contains essential information about the DataFrame's structure and contents, such as column names, data types, non-null counts, and memory usage. This code is typically used as one of the first steps in exploratory data analysis to quickly grasp the dataset before moving on to more data preprocessing, cleaning, or analysis tasks.

### FORMATTING THE DATA

Run to view results

After calculating the summary statistics, data formatting is a crucial step in the data preparation phase, as it helps in transforming the raw data into a consistent, reliable, and analysis-ready format (Wickham, 2014). The provided code uses Python's Pandas library to perform data type conversions on specific columns in a DataFrame. The conversions include changing the 'Ship Mode', 'Category', and 'Sub-Category' columns to the categorical data type for efficient storage and computation, converting 'Order Date' and 'Ship Date' columns to datetime data type for easier manipulation and extraction of date-related information, and converting 'Order ID', 'Product ID', and 'Customer ID' columns to string data type to ensure they are treated as textual data rather than numeric values.

### MISSING VALUES

Run to view results

Missing values can have an impact on both the analysis and model performance. Using df.isna().sum(), you can quickly assess the data's completeness and determine whether any columns have missing values that must be addressed. This code will return the number of missing values for each column, allowing you to determine which columns require additional investigation or data-cleaning steps, such as dropping rows with missing values or imputing missing values using appropriate techniques(Paul David Allison, 2002).

Run to view results

In many cases, the 'Row ID' column is expected to have unique values, acting as a distinct identifier for each row in the DataFrame. Checking the uniqueness of values in this column ensures data integrity and identifies any potential issues with duplicated rows(Van den Broeck et al., 2019). If df['Row ID'].is_unique returns True, it means that all values in the 'Row ID' column are unique, and each row in the DataFrame has its own identifier. If it returns False, it means there are duplicate values in the 'Row ID' column that need to be investigated further or cleaned up.

Run to view results

This code selects the column 'Order ID' from the DataFrame df. The values of the 'Order ID' column can be retrieved as a Series object using df['Order ID']. This enables us to perform operations or analysis on the 'Order ID' column, such as determining its uniqueness, calculating statistics, or filtering the DataFrame based on specific order IDs.

Run to view results

Checking for unique values in the 'Customer ID' column is an important data validation step, especially when working with customer-related data. Most customers should have a unique identifier to distinguish them from other customers.

The code df['Customer ID'].is_unique checks whether all the values in the 'Customer ID' column of the DataFrame df are unique or if there are any duplicates. It returns True if all the values are unique, meaning there are no duplicate customer IDs, and False if there are any duplicate values in the 'Customer ID' column.

Run to view results

In most cases, each product should have a distinct identifier that sets it apart from other products.

This code determines whether all of the values in the 'Product ID' column of the DataFrame df are unique or if any are duplicates. It returns True if all of the values are unique, indicating that there are no duplicate product IDs, and False if there are any duplicate values in the 'Product ID' column.

Run to view results

This code creates a new column named 'order_ship_date' in the DataFrame df. The values in this column are determined by comparing the 'Order Date' and 'Ship Date' columns. If the 'Order Date' is earlier than the 'Ship Date', the corresponding value in 'order_ship_date' will be True, indicating that the order was placed before the shipping date. Otherwise, if the 'Order Date' is later than or equal to the 'Ship Date', the value in 'order_ship_date' will be False.

This code selects a subset of rows from the DataFrame df based on two conditions using boolean indexing. The first condition (df['order_ship_date'] == False) selects rows where the value in the 'order_ship_date' column is False, meaning the 'Order Date' is not earlier than the 'Ship Date'. The second condition (df['Ship Mode'] != 'Same Day') selects rows where the value in the 'Ship Mode' column is not equal to 'Same Day'.

Run to view results

This code selects a subset of rows from the DataFrame df based on two conditions using boolean indexing. The first condition (df['Sales'] == 0) selects rows where the value in the 'Sales' column is equal to 0, indicating that no revenue was generated for that particular order or product. The second condition (df['Quantity'] == 0) selects rows where the value in the 'Quantity' column is equal to 0, indicating that no units were sold or ordered.

### CHECKING DUPLICATE DATA

Run to view results

The first line of code creates a new Series called combined_columns by concatenating the values from the 'Order ID', 'Product ID', and 'Customer ID' columns. The astype(str) function is used to convert the values to strings before concatenation to ensure consistent data types.

The second line of code selects rows from the DataFrame df where the combined_columns have duplicate values. The duplicated() function is used to identify duplicate values in the combined_columns Series. The keep=False parameter ensures that all duplicate rows are selected, not just the first occurrence (Nauman and Herschel, 2022). The resulting DataFrame duplicate_records contains all the rows that have duplicate values based on the combination of 'Order ID', 'Product ID', and 'Customer ID'.

By selecting the duplicate records using duplicate_records = df[combined_columns.duplicated(keep=False)], we can isolate the rows that have duplicate values based on the combination of 'Order ID', 'Product ID', and 'Customer ID'.

### REMOVING DUPLICATE DATA

Run to view results

The first line of code defines a list called relevant_columns that specifies the columns to be considered when identifying duplicate records. In this case, the combination of 'Order ID', 'Product ID', and 'Customer ID' is used to determine uniqueness.

The second line of code drops the duplicate records from the DataFrame df based on the combination of columns specified in relevant_columns. The drop_duplicates() function is used to remove duplicate rows, keeping only the first occurrence of each duplicate group. The subset parameter is set to relevant_columns to indicate the columns to be considered for identifying duplicates. The keep='first' parameter ensures that only the first occurrence of each duplicate group is kept, and the subsequent duplicates are removed.

Run to view results

Calculating descriptive statistics is a fundamental step in exploratory data analysis (EDA) and provides a quick overview of the distribution and central tendencies of numerical variables in a dataset(Grolemund and Wickham, 2014).

The code defines a list called numerical_columns, which includes 'Sales', 'Quantity', and 'Discount' columns for calculating descriptive statistics. The describe() function calculates summary statistics for each numerical column, including count, mean, standard deviation, minimum, 25th percentile, 50th percentile (median), 75th percentile, and maximum, using the df[numerical_columns] syntax.

Run to view results

This line of code selects rows from the DataFrame df where the value in the 'Quantity' column is greater than 100. It uses boolean indexing to filter the DataFrame based on the specified condition.

### OUTLIER THRESHOLD

Run to view results

The code snippet uses an outlier detection algorithm with a 3 standard deviation from the mean threshold. It iterates through each numerical column, calculates mean and standard deviation, defines lower and upper bounds for outliers, filters the dataset to identify rows exceeding these bounds, and prints out outlier values for each column, crucial for identifying and addressing data irregularities in statistical analyses or machine learning(Hawkins, 2013).

### CORRELATION AMONGST VARIABLES

Run to view results

Correlation matrices and heatmap visualizations are commonly used in exploratory data analysis (EDA) to identify and visualize the relationships between variables (Friendly & Kwan, 2003). The heatmap provides a quick and intuitive way to assess the strength and direction of correlations between variables (Wilkinson & Friendly, 2009).

Analysts can gain insights into the relationships between variables by examining the correlation matrix and heatmap. Strong positive or negative correlations may indicate potential associations or dependencies between variables, which can be further investigated or considered in subsequent analyses (Salkind, 2010)

Run to view results

The code imports the Superstore dataset from a CSV file using pandas' read_csv() function. It checks for missing values using isnull().sum() and fills missing values in the 'Discount' and 'Returned' columns, ensuring data completeness for further analysis.

The code uses boxplots to identify outliers in the 'Sales', 'Quantity', 'Discount', and 'Profit' columns. It then removes extreme values using boolean indexing, ensuring accurate results by eliminating outliers that may skew the analysis and lead to misleading results(Aggarwal & Yu, 2001).

The code conducts exploratory data analysis on the Superstore dataset, calculating summary statistics and creating categorical and continuous plots to identify patterns, trends, and potential relationships. It calculates summary statistics, creates a categorical plot to visualize average profit by customer segment, and a continuous plot to explore the relationship between 'Discount' and 'Profit'(Unwin, 2015).

The code creates a linear regression model using scikit-learn(Pedregosa et al., 2011) to identify key profit drivers, using 'Discount' and 'Quantity' as independent variables and 'Profit' as the dependent variable. The model is trained using model.fit(), and the R-squared value indicates the proportion of variance explained by the model. This analysis quantifies the impact of 'Discount' and 'Quantity' on 'Profit', providing insights into profitability factors(Montgomery et al., 2012).

### LINEAR REGRESSION

Run to view results

Linear Regression Model Prediction for Superstore Dataset Data Preparation: • Reads 'dataset_Superstore.csv' file into pandas DataFrame. • Fills missing values in 'Discount' and 'Returned' columns with 0 and False respectively. • Removes extreme values in 'Sales', 'Quantity', and 'Profit' columns using boolean indexing (Van der Loo, 2014)

Feature Selection and Target Variable: • Selects 'Discount' and 'Quantity' as independent variables and 'Profit' as the dependent variable (Ziegler, 2019).

Data Splitting and Scaling: • Data split into training and testing sets using train_test_split() from scikit-learn. • Features scaled using StandardScaler().

Model Training and Prediction: • Creates a linear regression model using LinearRegression() from scikit-learn. • Trains the model on scaled training data and makes predictions on scaled testing data (Gareth Michael James et al., 2013).

Model Evaluation: • Evaluates model's performance using R-squared, mean_squared_error, and Root Mean Squared Error. • Prints coefficients and intercept of the linear regression model.

Visualization: • Creates a scatter plot to visualize the relationship between actual and predicted profit values (Hunter, J. D. 2007).

### CROSS-VALIDATION

Run to view results

Based on the cross-validation results, the linear regression model seems to have a relatively poor performance. The mean cross-validated R-squared of 0.021232498956021483 is quite low, indicating that the model explains only a small portion of the variance in the target variable. The inconsistency in the R-squared scores across the folds, with some scores being very low or even negative, suggests that the model may not generalize well to unseen data.

These results imply that the linear regression model, with the current set of features, may not be the best fit for the given problem. It might be necessary to explore other feature engineering techniques, consider non-linear relationships, or try alternative modeling approaches to improve the model's performance (Hastie et al., 2009).

### Visualizing Relationships between Variables using Scatter Plots

Run to view results

1) Scatter Plot: Sales vs. Profit: The scatter plot is a tool used to visualize the relationship between 'Sales' and 'Profit' variables (Waskom, 2021). It uses the sns.scatterplot() function from the Seaborn library, with the data parameter specifying the DataFrame (df) and the x and y parameters specifying the variables to be plotted. The plot helps identify patterns or correlations between sales and profit, allowing for better understanding of data distribution.

2) Scatter Plot: Discount vs. Profit (Color-coded by Category): The second scatter plot shows the relationship between 'Discount' and 'Profit' variables, color-coded by the 'Category' variable. The sns.scatterplot() function is used to create the plot, with the hue parameter set to 'Category' and the palette parameter set to 'Set2'. The plot displays the distribution of data points, allowing for identification of patterns or differences in the relationship between discount and profit across different product categories (Wickham, 2010).

3) Plot Customization: The code sets the figure size, sets title and labels for x-axis and y-axis using plt.title(), plt.xlabel(), and plt.ylabel() functions, and displays the plots using plt.show().

### Customer Segmentation

Run to view results

INTERPRETATION:

Center: [1.48890996e+02, -1.02367643e+02, 3.84311838e+00, 6.70596728e-01] This cluster represents customers with relatively low sales (148.89), negative profit (-102.37), an average quantity of 3.84, and a high discount rate of 0.67. These customers are likely to be unprofitable and receive high discounts.

Center: [1.26472974e+02, 2.13251488e+01, 2.53269800e+00, 9.77998686e-02] This cluster represents customers with moderate sales (126.47), positive profit (21.33), an average quantity of 2.53, and a low discount rate of 0.098. These customers are profitable and receive minimal discounts.

Center: [4.08592839e+02, 7.01026526e+01, 6.44961240e+00, 9.44785060e-02] This cluster represents customers with relatively high sales (408.59), positive profit (70.10), an average quantity of 6.45, and a low discount rate of 0.094. These customers are highly profitable and receive minimal discounts.

Center: [7.68517926e+03, 2.61022009e+03, 5.18518519e+00, 7.03703704e-02] This cluster represents customers with extremely high sales (7685.18), extremely high profit (2610.22), an average quantity of 5.19, and a very low discount rate of 0.070. These customers are the most profitable and receive the lowest discounts.

### Predicting Profit

Run to view results

The R-squared value of 0.082 indicates that the linear regression model explains 8.2% of the variance in the profit data. While this value is relatively low, it suggests that the selected features (Discount and Quantity) have some influence on profitability.

The discount coefficient is -254.26, which indicates that for every unit increase in discount, the profit decreases by $254.26, holding quantity constant. This suggests that higher discounts have a negative impact on profitability.

Quantity coefficient: 0.01 indicates that for every unit increase in quantity, the profit increases by $0.01, holding the discount constant. However, the impact of quantity on profit is relatively small compared to the discount.

The intercept of 68.98 represents the expected profit when both discount and quantity are zero.

### Evaluating Linear Regression Model

Run to view results

The residual plot is created to assess the linearity and homoscedasticity assumptions of linear regression. The plot shows the residuals (differences between actual and predicted values) against the predicted values. Ideally, the residuals should be randomly scattered around the zero line without any clear patterns(Maddala et al., 2010).

### Visualizing Feature Importances

Run to view results

The plot allows us to compare the relative importance of the features. The feature with the larger absolute coefficient value is considered more important in predicting the target variable. In this case, the plot shows that the 'Discount' feature has a stronger impact on profit compared to the 'Quantity' feature.

By visualizing the feature importance, we can gain insights into which features have the most significant influence on the target variable and make informed decisions based on these findings.

### Profitability Trend

Run to view results

The cross-validated R-squared scores show the model's performance on different subsets of the data, with an average score of 0.021. This low mean cross-validated R-squared indicates that the model explains only 2.1% of the variance in the out-of-sample data, suggesting poor performance and limited generalization ability.

These results suggest that the chosen features and models may not be sufficient to accurately predict the target variable ('Profit') in the Superstore dataset. Further feature engineering, model selection, and hyperparameter tuning may be necessary to improve the model's performance.

### CONCLUSION:

Big data analytics can significantly aid the company in increasing profits through various data-driven strategies. Customer segmentation analysis enables targeted marketing, tailored offerings, and optimized pricing to better serve distinct customer groups based on their behaviors and profitability. Evaluating product category performance highlights opportunities to prioritize high-profit product lines and streamline underperforming ones. Findings on the negative impact of excessive discounting can guide more judicious discount policies. Analyzing yearly profitability trends provides insights into seasonality and external factors affecting performance over time, informing operational adjustments. Furthermore, assessing the relative importance of features like discounts and quantity sold can direct strategic efforts toward the most influential profit drivers. While modeling limitations exist, the comprehensive application of big data analytics techniques generates valuable, actionable insights to enhance decision-making, operational efficiencies, and ultimately, increase overall profitability for the company.

### REFERENCES:

Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection for high dimensional data. Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, 37-46.

Davenport, Thomas. “Competing on Analytics: The New Science of Winning.” Choice Reviews Online, vol. 44, no. 11, 1 July 2007, pp. 44–632244–6322, https://doi.org/10.5860/choice.44-6322.

Few, Stephen. Now You See It: Simple Visualization Techniques for Quantitative Analysis. Oakland, Calif., Analytics Press, 2009.

Friendly, M., & Kwan, E. (2003). Effect ordering for data displays. Computational Statistics & Data Analysis, 43(4), 509-539.

Grolemund, Garrett, and Hadley Wickham. “A Cognitive Interpretation of Data Analysis.” International Statistical Review, vol. 82, no. 2, 11 Apr. 2014, pp. 184–204, https://doi.org/10.1111/insr.12028.

Hair, Joseph F. Multivariate Data Analysis. Pearson, 2010.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer Science & Business Media.

Hawkins, D. Identification of Outliers. Springer Science & Business Media, 17 Apr. 2013.

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90-95.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 3).

Maddala, G S, et al. Introduction to Econometrics. Chichester, England, Ileywiley, 2010.

Mann, P. S. (2007). Introductory statistics. John Wiley & Sons.

McKinney, W. (2010). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51-56)

Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to linear regression analysis. John Wiley & Sons.

Nauman, Felix, and Melanie Herschel. An Introduction to Duplicate Detection. Springer Nature, 1 June 2022.

Osborne, Jason W. “Data Cleaning Basics: Best Practices in Dealing with Extreme Scores.” Newborn and Infant Nursing Reviews, vol. 10, no. 1, Mar. 2010, pp. 37–43, https://doi.org/10.1053/j.nainr.2009.12.009. Accessed 29 May 2020.

Ott, Lyman, and Michael Longnecker. An Introduction to Statistical Methods & Data Analysis. Pacific Grove, Brooks/Cole, 2020.

Paul David Allison. Missing Data. Thousand Oaks, Calif., Sage Publications, 2002.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830.

Salkind, N. J. (2010). Encyclopedia of research design. SAGE Publications.

Shanks, Graeme, et al. The Impact of Strategy and Maturity on Business Analytics and Firm Performance: A Review and Research Agenda. 1 Dec. 2010, p. 1. Accessed 15 Mar. 2024.

Unwin, A. (2015). Graphical data analysis with R. CRC Press.

Van den Broeck, Jan, et al. “Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities.” PLoS Medicine, vol. 2, no. 10, 6 Sept. 2005, p. e267, www.ncbi.nlm.nih.gov/pmc/articles/PMC1198040/, https://doi.org/10.1371/journal.pmed.0020267. Accessed 23 Oct. 2019.

Van der Loo, M. P. (2014). The stringdist package for approximate string matching. The R Journal, 6(1), 111-122

Waskom, M. L. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021.

Wickham, H. (2010). A layered grammar of graphics. Journal of Computational and Graphical Statistics, 19(1), 3-28.

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.

Wilkinson, L., & Friendly, M. (2009). The history of the cluster heat map. The American Statistician, 63(2), 179-184.

Ziegler, Andreas. “An Introduction to Statistical Learning with Applications. R. G. James, D. Witten, T. Hastie, and R. Tibshirani (2013). Berlin: Springer.