INCREASING PROFITS THROUGH DATA ANALYTICS

BIG DATA ANALYTICS-C11BD

INTRODUCTION

The document "Increasing Profits through Data Analytics" offers a detailed examination of the dataset_Superstore.csv using a variety of data analytics methods. Its main goal is to pinpoint the factors that influence the company's profitability and offer data-driven insights for strategic decision-making. It underscores the significance of data analytics in today's business landscape, where informed decisions based on data are crucial for operational enhancement, revenue growth, and competitive advantage.

This document showcases how data analytics techniques are practically applied to real-world business situations, serving as a valuable tool for understanding how data-driven insights can bolster a company's expansion and enhance long-term profitability in a competitive market. The analysis delves into uncovering trends and patterns that impact profitability through data cleaning, summary statistics examination, and visualization creation. Moreover, modeling techniques are utilized to identify the primary drivers of profitability and provide recommendations based on the analysis results.

This passage underscores the importance of leveraging data analysis for making informed business decisions in today's operational landscape. By utilizing data analytics, businesses can extract valuable insights from their data, enabling them to make strategic decisions and optimize resource utilization effectively. Through a meticulous evaluation of modeling strategies and a detailed interpretation of results, a deep understanding and practical application to real business challenges are demonstrated.

METHODOLOGY

To ensure the accuracy and reliability of the findings, the analysis follows a structured methodology. The initial step involves data cleaning to ensure the quality and consistency of the dataset by identifying and eliminating outliers and data entry errors. This step is crucial as it lays the groundwork for precise and meaningful insights.

Subsequently, summary statistics are computed for the cleaned data to uncover details about its distribution, variability, and composition. Further understanding of the data is achieved by exploring relationships between variables and creating visualizations such as categorical bar charts and continuous scatter plots.

The primary modeling technique employed in the analysis is K-means clustering, which is utilized to identify customer segments based on their purchasing behavior and assess each cluster's profitability. The elbow method is employed to determine the optimal number of clusters, while metrics like the Davies-Bouldin Index and Silhouette Score are used to evaluate clustering outcomes. Additionally, the Kruskal-Wallis test is performed to showcase significant differences in average profit among clusters, highlighting the importance of customer segmentation in understanding profitability trends.

IMPORTIG FUNCTIONS

Let's import the relevant libraries and functions to get the analysis started for a give data.

#Importing the necessary libraries for data analysis import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score, classification_report from scipy.stats import kruskal from sklearn.metrics import silhouette_score, davies_bouldin_score from scipy import stats from sklearn.preprocessing import StandardScaler # conclude of algorithms # This code imports the matplotlib, pandas, numpy, scikit-learn, seaborn, #scipy libraries that are required for data analysis. Train_test_split, KMeans, StandardScaler, DecisionTreeClassifier, accuracy_score, classification_report, kruskal, silhouette_score, davies_bouldin_score, #stats are among the modules and functions that are specifically imported from these libraries. # In addition, StandardScaler is imported once more, possibly for use in an alternative context. # If more libraries are required, they can be imported.

Run to view results

By assisting with data analysis, model construction, and performance evaluation, these features and tools make our work simpler and more effective.

IMPORTING THE GIVEN DATA (dataset_Superstore.csv)

data = pd.read_csv('dataset_Superstore.csv') data.head() # The `read_csv()` function from the pandas library is used in this code to load the data from the CSV file 'dataset_Superstore.csv' into a DataFrame named 'data'. # After that, it uses the `head()` method to display the first few rows of the DataFrame in order to confirm that the data has been loaded properly.

Run to view results

Displaying the whole set of Data

data #data is used to display the whole set of data

Run to view results

DISPLAYING THE DATASET SUMMARY

print(data.info()) #conclude of the algorithm #Using the `info()` method, this code prints the details of the DataFrame 'data'. #This method gives a brief overview of the DataFrame, including data types, non-null counts, and memory usage. # Upon printing the data, the full DataFrame 'data' is shown.

Run to view results

DATA TYPE CONVERSION AND COLUMN CLEANING

data[['Ship Mode', 'Category', 'Sub-Category']] = data[['Ship Mode', 'Category', 'Sub-Category']].astype('category') data['Order Date'] = pd.to_datetime(data['Order Date'], format='%d/%m/%Y') data['Ship Date'] = pd.to_datetime(data['Ship Date'], format='%d/%m/%Y') data[['Order ID', 'Product ID', 'Customer ID']] = data[['Order ID', 'Product ID', 'Customer ID']].astype('str') data #Conclude of the algorithm # This section of code assigns the proper data types to certain columns in the DataFrame 'data'. #The columns "Order Date" and "Ship Date" are converted to datetime format, "Order ID," "Product ID," and "Customer ID" # now they converted to string format, and "Ship Mode," "Category," and "Sub-Category" are converted to categorical data type. # The updated data with the required data types is contained in the DataFrame 'data' following conversion.

Run to view results

PLOTING AND REPORTING MISSING VALUES IN A DATA

missing_values = data.isnull().sum() print("Missing values in each column:") print(missing_values) #conclusion of the code # 'isnull()' function and'sum()' function are used. # in this code segment to count the number of missing values in each column of the DataFrame 'data'. # The number of missing values for each column is then printed.

Run to view results

CHECKING UNIQUENESS OF ROW IDs AND ACCESSING ORDER IDs

# Checking if values in 'Row ID' column are unique or not is_unique_row_id = data['Row ID'].is_unique print("Are values in 'Row ID' column unique?", is_unique_row_id) # Access the 'Order ID' column order_ids = data['Order ID'] print("Order IDs:", order_ids) # explation of the code # This code segment checks whether the values in the 'Row ID' column are unique by using the 'is_unique' attribute. # It then prints a message indicating whether the values in the 'Row ID' column are unique or not # Next, it accesses the 'Order ID' column from the DataFrame and assigns it to the variable 'order_id # Finally, it prints the 'Order IDs' along with their values.

Run to view results

HANDLING MISSING VALUES AND DETECTING OUTLIERS

combined_columns = data['Order ID'].astype(str) + data['Product ID'].astype(str) + data['Customer ID'].astype(str) duplicate_records = data[combined_columns.duplicated(keep=False)] print("Duplicate records based on combined columns:\n", duplicate_records) relevant_columns = ['Order ID', 'Product ID', 'Customer ID'] data = data.drop_duplicates(subset=relevant_columns, keep='first') print("Duplicate records removed.") outlier_threshold = 3 numerical_columns = ['Sales', 'Quantity', 'Discount'] for column in numerical_columns: mean = data[column].mean() std = data[column].std() lower_bound = mean - (outlier_threshold * std) upper_bound = mean + (outlier_threshold * std) outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)] print(f"\nOutliers in {column}:") print(outliers) #The 'Order ID', 'Product ID', and 'Customer ID' columns are combined in the code to generate a unique ordering ID. # The combined columns are used to identify duplicate records. # Duplicate records are removed, with the first instance of each record retained. #'Sales', 'Quantity', and 'Discount' numerical columns #identify outliers based on a threshold of three standard deviations from the mean. # After that, each numerical column's outliers are printed.

Run to view results

PLOTTING OUTLIERS FOR SALES & QUANTITY

Q1_sales, Q3_sales = data['Sales'].quantile([0.25, 0.75]) IQR_sales = Q3_sales - Q1_sales Q1_quantity, Q3_quantity = data['Quantity'].quantile([0.25, 0.75]) IQR_quantity = Q3_quantity - Q1_quantity upper_sales = Q3_sales + 1.5 * IQR_sales lower_sales = Q1_sales - 1.5 * IQR_sales upper_quantity = Q3_quantity + 1.5 * IQR_quantity lower_quantity = Q1_quantity - 1.5 * IQR_quantity sales_outliers = data[(data['Sales'] > upper_sales) | (data['Sales'] < lower_sales)] quantity_outliers = data[(data['Quantity'] > upper_quantity) | (data['Quantity'] < lower_quantity)] plt.figure(figsize=(10, 6)) sns.boxplot(x=data['Sales']) sns.scatterplot(x=sales_outliers.index, y=sales_outliers['Sales'], color='red') plt.title('Outliers in "Sales" column') plt.show() plt.figure(figsize=(10, 6)) sns.boxplot(x=data['Quantity']) sns.scatterplot(x=quantity_outliers.index, y=quantity_outliers['Quantity'], color='red') plt.title('Outliers in "Quantity" column') plt.show() ##Explation #The code determines the outliers as any values that are either above Q3 + 1.5 * IQR or below Q1 - 1.5 * IQR. # It does this by calculating the IQR for both the quantity and sales columns. #The outliers are then plotted in red on top of each column's boxplot by the code. #The boxplot facilitates easy visualization of the outliers by displaying # both the distribution of the data points and the outliers.

Run to view results

CLEANING AND ANALYSING THE DATA

numerical_columns = ['Sales', 'Quantity', 'Discount'] descriptive_statistics = data[numerical_columns].describe() print(descriptive_statistics) cleaned_data = data.dropna() print('Cleaned Data') print(cleaned_data.head()) print("Shape of cleaned data:", cleaned_data.shape) # Clarification # Initial exploration: To determine the data's central tendency and distribution, summary statistics for numerical features are calculated. # - The numerical columns "Sales," "Quantity," and "Discount" are included. # handling Missing Values: # - To maintain consistency and clarity throughout the dataset, rows containing missing values are removed. # Verifying the Data: # - After addressing missing values, the cleaned dataset is printed for inspection of the first few rows. # - To determine how many rows and columns make up the final dataset, the shape of the cleaned data is shown.

Run to view results

SAVE A CLEANED DATA AND LOADING TO FILES

cleaned_data.to_csv('dataset_Superstore_cleaned.csv', index=False) cleaned_data = pd.read_csv('dataset_Superstore_cleaned.csv') cleaned_data.to_csv('dataset_Superstore_cleaned.csv', index=False) print("Cleaned data saved to 'dataset_Superstore_cleaned.csv'.") data #The code saves the cleaned data to a CSV file named 'dataset_Superstore_cleaned.csv' #and then loads the cleaned data from the same CSV file into a new DataFrame

Run to view results

SUMMARY STATISTICS AND ANALYSIS OF CLEANED DATA

print("Summary Statistics of Cleaned Data:") print(cleaned_data.describe()) print("Random sample of cleaned data:") print(cleaned_data.sample(5)) mean_sales = cleaned_data['Sales'].mean() median_sales = cleaned_data['Sales'].median() std_sales = cleaned_data['Sales'].std() sales_range = cleaned_data['Sales'].max() - cleaned_data['Sales'].min() category_counts = cleaned_data['Category'].value_counts() print("//-------------------//") print("Summary Statistics:") print("//-------------------//") print("Numerical Features:") print(f"Mean Sales: {mean_sales}") print(f"Median Sales: {median_sales}") print(f"Standard Deviation (Sales): {std_sales}") print(f"Sales Range: {sales_range}") print("\nCategorical Features:") print(category_counts) # Check Data Summary Statistics # Visual Inspection # Calculating Summary Statistics and Central Tendency # Spread # Display Summary Statistics

Run to view results

DATA VERIFICATION AFTER CLEANING

# Check for Missing Values after cleaing missing_values = cleaned_data.isnull().sum() if missing_values.sum() == 0: print("No missing values in the cleaned dataset") else: print("Missing values found in the cleaned dataset:") print(missing_values) #Cross-Validation original_indices = set(data.index) cleaned_indices = set(cleaned_data.index) if original_indices == cleaned_indices: print("Cross-validation successful: The cleaned dataset is consistent with the original dataset.") else: print("Cross-validation failed: The cleaned dataset is not consistent with the original dataset.") #Statistical Tests from scipy.stats import ttest_ind t_statistic, p_value = ttest_ind(data['Sales'], cleaned_data['Sales']) print("T-statistic:", t_statistic) print("P-value:", p_value) if p_value < 0.05: print("The difference in 'Sales' before and after cleaning is statistically significant.") else: print("No statistically significant difference in 'Sales' before and after cleaning.") # Explanation: # - To compare the 'Sales' column before and after cleaning, this section performs a statistical test, specifically a t-test. # - The t-test_ind function from scipy.stats is used to compute the t-statistic and p-value. # - The p-value denotes the significance of the difference between the means of "Sales" before and after cleaning, #while the t-statistic measures the difference between those means. # - The conclusion is that there is no statistically significant difference #if the p-value is less than 0.05, indicating that the difference in "Sales" before and after cleaning is not statistically significant.

Run to view results

BOXPLOT AFTER CLEAING A DATA

# Visualize Outliers plt.figure(figsize=(10, 6)) sns.boxplot(data=cleaned_data) plt.title('Boxplot of Cleaned Data') plt.xticks(rotation=90) # Rotate x-axis labels for better readability plt.show()

Run to view results

Boxplot of Cleaned Data: Boxplot of Cleaned Data: The initial visualization is a boxplot that shows how the cleaned dataset's numerical features are distributed. It sheds light on the dataset's overall central tendency, dispersion, and outlier presence. The x-axis shows the various numerical features, while the y-axis shows the range of values for each feature. If there are any outliers, they are shown as individual data points that are outside the boxplot's.

PLOTING TOTAL PROFIT AND CATEGAICAL BAR CHARTS

plt.figure(figsize=(16, 12)) plt.subplot(2, 2, 1) category_profits = data.groupby('Category', observed=False)['Profit'].sum().sort_values(ascending=False) plt.bar(category_profits.index, category_profits.values, color='lightblue') plt.xlabel('Product Category') plt.ylabel('Total Profit') plt.title('Total Profit by Product Category') plt.xticks(rotation=45) plt.subplot(2, 2, 2) subcategory_profits = data.groupby('Sub-Category', observed=False)['Profit'].sum().sort_values(ascending=False) plt.bar(subcategory_profits.index, subcategory_profits.values, color='lightgreen') plt.xlabel('Product Sub-Category') plt.ylabel('Total Profit') plt.title('Total Profit by Product Sub-Category') plt.xticks(rotation=90) plt.tight_layout() plt.show() ## Justification ## # The code creates two bar charts to visualize the total profit by different product categories and sub-categories. # The first bar chart shows the total profit by product category, while the second one shows the total profit by product sub-category. # These visualizations help in identifying which categories and sub-categories contribute the most to the overall profit.

Run to view results

Total Profit by Product Category: The total profit produced by each product category is shown in this bar chart.Different product categories are shown on the x-axis, and the total profit is shown on the y-axis. The height of each bar represents the total profit for a particular product category. Light blue coloration is used on the bars to improve visual appeal. For easier reading, the product categories on the x-axis labels have been rotated by 45 degrees. The title of the chart, "Total Profit by Product Category," gives the visualization a clear context. Total Profit by Product Sub-Category: The total profit made by each product subcategory is shown in this bar chart.Like in the prior chart, the total profit is shown on the y-axis, while various subcategories are represented on the x-axis.The height of each bar represents the total profit for that specific subcategory. For distinction, the bars are colored a light green. The x-axis labels (sub-categories) are rotated by ninety degrees to enhance readability. This chart is titled "Total Profit by Product Sub-Category."

PLOTING SCATTER PLOT FOR CONTINOUS DATA

import numpy as np from scipy import stats import matplotlib.pyplot as plt x = np.random.rand(100) y = 0.5 * x + np.random.rand(100) * 0.2 sizes = np.random.rand(100) * 100 categories = np.random.choice(['Category A', 'Category B', 'Category C'], size=100) fig, axs = plt.subplots(2, 2, figsize=(12, 10)) axs[0, 0].scatter(x, y, color='steelblue') slope, intercept, _, _, _ = stats.linregress(x, y) axs[0, 0].plot(x, slope * x + intercept, color='red', linestyle='--', label='Regression Line') axs[0, 0].set_xlabel('X') axs[0, 0].set_ylabel('Y') axs[0, 0].set_title('Scatter Plot with Regression Line') axs[0, 0].legend() for category in np.unique(categories): mask = categories == category axs[0, 1].scatter(x[mask], y[mask], label=category) axs[0, 1].set_xlabel('X') axs[0, 1].set_ylabel('Y') axs[0, 1].set_title('Scatter Plot with Categorical Colors') axs[0, 1].legend() axs[1, 0].scatter(x, y, s=sizes, alpha=0.6) axs[1, 0].set_xlabel('X') axs[1, 0].set_ylabel('Y') axs[1, 0].set_title('Scatter Plot with Varying Point Sizes') axs[1, 1].scatter(data['Sales'], data['Profit'], color='steelblue', alpha=0.6) axs[1, 1].set_xlabel('Sales') axs[1, 1].set_ylabel('Profit') axs[1, 1].set_title('Sales vs. Profit') plt.figure(figsize=(10, 6)) plt.scatter(data['Sales'], data['Profit'], color='steelblue', alpha=0.6) plt.title('Relationship between Sales and Profit') plt.xlabel('Sales') plt.ylabel('Profit') plt.grid(True) z = np.polyfit(data['Sales'], data['Profit'], 1) p = np.poly1d(z) plt.plot(data['Sales'], p(data['Sales']), color='red', linestyle='--', label='Trend Line') plt.legend() plt.tight_layout() plt.show() ## Justification ## # The code generates scatter plots to visualize relationships between variables. # It includes scatter plots with regression lines, categorical colors, varying point sizes, and a specific plot of Sales vs. Profit with a trend line.

Run to view results

Scatter Plot with Regression Line: ->Shows data points scattered on a plot ->A straight line is fitted through the points to show the overall trend ->The x-axis represents one variable (X), and the y-axis represents another variable (Y) ->The line indicates the relationship between X and Y Scatter Plot with Categorical Colors: ->Data points are colored differently based on categories (A, B, C) ->Each color represents a different category ->Helps visualize patterns across different categories Scatter Plot with Varying Point Sizes: ->Point sizes vary based on a third variable ->Larger points represent higher values of the third variable ->Shows the relationship between two variables while incorporating a third variable Sales vs. Profit Scatter Plot with Trend Line: ->Specific plot showing the relationship between Sales (x-axis) and Profit (y-axis) ->A trend line is fitted to the data points using regression ->Helps understand the overall direction and strength of the relationship between Sales and Profit In summary, these scatter plots visualize relationships between variables, with additional features like regression lines, categorical colors, and varying point sizes to provide more insights into the data.

K-MEANS CLUSTERING MODEL

from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans import matplotlib.pyplot as plt # Standardizing the features of numerical data scaler = StandardScaler() X_scaled = scaler.fit_transform(data.select_dtypes(include=[np.number])) # Determining the optimal number of clusters using the elbow method inertia = [] for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42) kmeans.fit(X_scaled) inertia.append(kmeans.inertia_) # Plotting the elbow curve plt.figure(figsize=(10, 6)) plt.plot(range(1, 11), inertia, marker='o', linestyle='--') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.title('Elbow Method') plt.grid(True) plt.show() # Based on the elbow curve, let's choose 5 clusters n_clusters = 5 # Performing k-means clustering kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=42) kmeans.fit(X_scaled) # Adding cluster labels to the DataFrame data['Cluster'] = kmeans.labels_ # Visualizing the clusters plt.figure(figsize=(10, 6)) plt.scatter(data['Sales'], data['Profit'], c=data['Cluster'], cmap='viridis', alpha=0.5) plt.xlabel('Sales') plt.ylabel('Profit') plt.title('Profit Segmentation') plt.grid(True) plt.colorbar(label='Cluster') plt.show() # Calculate the average profit for each cluster cluster_profit_means = data.groupby('Cluster')['Profit'].mean() # Plot histograms for profit distribution within each cluster plt.figure(figsize=(12, 6)) for cluster_label in range(n_clusters): plt.hist(data[data['Cluster'] == cluster_label]['Profit'], bins=20, alpha=0.5, label=f'Cluster {cluster_label}') plt.xlabel('Profit') plt.ylabel('Frequency') plt.title('Profit Distribution within Clusters') plt.legend() plt.grid(True) plt.show() # Visualize profit change over time for each cluster plt.figure(figsize=(12, 6)) for cluster_label in range(n_clusters): cluster_data = data[data['Cluster'] == cluster_label].set_index('Order Date') cluster_profit = cluster_data.resample('M')['Profit'].sum() # Monthly sum of profit plt.plot(cluster_profit.index, cluster_profit.values, label=f'Cluster {cluster_label}') plt.xlabel('Date') plt.ylabel('Profit') plt.title('Change in Profit over Time') plt.legend() plt.grid(True) plt.show() ## Justification ## # The code performs K-means clustering on the dataset to identify distinct groups within the data based on standardized numerical features. # It determines the optimal number of clusters using the elbow method and then applies K-means clustering with the chosen number of clusters. # After clustering, it analyzes cluster characteristics such as mean values of numerical features and visualizes cluster characteristics over time using line plots. # Additionally, it visualizes the distribution of profit within each cluster using histograms and represents the clusters in a scatter plot based on sales and profit, # where each cluster is represented by a different color. These visualizations aid in understanding the structure of the data and identifying patterns within the clusters.

Run to view results

EXPLAING THE CODE

First, the dataset's numerical features are standardized using scikit-learn's StandardScaler. The sensitivity of the K-means algorithm to feature magnitudes depends on all features exhibiting a consistent scale, which is ensured by this preprocessing step. This procedure complies with accepted preprocessing standards for machine learning (Pedregosa et al., 2011).

After that, the elbow method is used to determine the ideal number of clusters. The elbow point indicates the ideal cluster count and provides information about the intrinsic structure of the dataset. It is calculated and plotted for different cluster numbers using the inertia (within-cluster sum of squares). For cluster count selection in K-means clustering, this method is a well-known heuristic (Thorndike, 1953).

Using the selected cluster count and the k-means++ initialization technique, the next K-means clustering run makes it easier to assign cluster labels to individual data points. According to Arthur and Vassilvitskii (2007), this method ensures more robust clustering outcomes by improving convergence and mitigating initialization biases.

After clustering, the cluster characteristics analysis is calculating the average values of the numerical features in each cluster to clarify the average feature distribution and help distinguish different cluster traits. These kinds of analyses are essential to exploratory data analysis and help to understand cluster behaviors (Tukey, 1977).

Plotted trends over time, profit distribution histograms, and scatter plots that show cluster separations in feature space are examples of visualizations that are essential to the cluster characterization process. These graphic depictions aid in intuitive interpretation and offer useful insights into cluster dynamics (Wickham, 2010).

IDENTIFYING CLUSTER CENTROIDS USING KMEANS CLUSTERING

from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42) kmeans.fit(X_scaled) cluster_centroids = kmeans.cluster_centers_ print("Cluster Centroids:") print(cluster_centroids) # Importing KMeans from scikit-learn # Instantiating KMeans with 5 clusters, 'k-means++' initialization, and a fixed random state # Fitting KMeans to the standardized data # Retrieving cluster centroids # Printing cluster centroids

Run to view results

Initializing KMeans: Firstly initializes the KMeans object with parameters such as the desired number of clusters (n_clusters=5), the initialization method for centroids (init='k-means++'), and a fixed random seed for reproducibility (random_state=42).

Fitting the Model: The KMeans object is then fitted to the standardized data (X_scaled). This process involves grouping data points into clusters based on their similarity and iteratively updating cluster centroids until convergence.

Getting Cluster Centroids: After fitting the model, the code retrieves the centroids of the clusters using the cluster_centers_ attribute. These centroids represent the average position of data points within each cluster.

Printing Cluster Centroids: Finally, the code prints out the obtained cluster centroids, providing a summary of each cluster's characteristics in terms of the original feature space.

SILHOUETTE SCORE, DAVIES-BOULDIN INDEX, AND CLUSTER SEPARATION

from sklearn.metrics import silhouette_score, davies_bouldin_score X_scaled = StandardScaler().fit_transform(data.select_dtypes(include=[np.number])) kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=42).fit(X_scaled) silhouette_avg = silhouette_score(X_scaled, kmeans.labels_) davies_bouldin = davies_bouldin_score(X_scaled, kmeans.labels_) # Print the silhouette score print("Silhouette Score:", silhouette_avg) # Print the Davies-Bouldin index print("Davies-Bouldin Index:", davies_bouldin) plt.figure(figsize=(10, 6)) plt.scatter(X_scaled[:, 0], X_scaled[:, 4], c=kmeans.labels_, cmap='viridis', alpha=0.5) plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 4], marker='x', s=100, color='red', label='Centroids') plt.xlabel('Scaled Sales') plt.ylabel('Scaled Profit') plt.title('K-means Clustering: Cluster Separation') plt.legend() plt.grid(True) plt.show() # Importing silhouette_score and davies_bouldin_score from sklearn.metrics # Standardizing the features using StandardScaler # Performing k-means clustering with the specified number of clusters, 'k-means++' initialization, and fixed random state # Calculating the silhouette score to evaluate clustering performance # Printing the silhouette score # Calculating Davies-Bouldin index to evaluate clustering performance # Printing the Davies-Bouldin index # Visualizing cluster separation in the feature space with centroids marked

Run to view results

1. Silhouette Score: The Silhouette Score is a metric that measures how well each data point fits into its assigned cluster compared to other clusters. It takes into account both the cohesion within a cluster and the separation between different clusters. The score ranges from -1 to 1, where higher values indicate better clustering. In our case, the Silhouette Score is approximately 0.1616. While this value is positive, indicating that the clustering is better and it is relatively low. A score closer to 1 would suggest a more optimal clustering solution. 2. Davies-Bouldin Index: The Davies-Bouldin Index is another metric used to evaluate the quality of clustering. It measures the average similarity between each cluster and its most similar cluster, considering both the size and scatter of the clusters. The goal is to minimize this index, as lower values indicate better clustering. In our case, the Davies-Bouldin Index is approximately 1.781. In general, a lower Davies-Bouldin Index suggests better clustering. 3. Visual Inspection of Cluster Separation: To gain a visual understanding of the clustering results, a scatter plot is created. Each data point is represented by a dot, and the color of the dot corresponds to its assigned cluster. Additionally, the cluster centroids (cluster centers) are marked with red 'x' markers. This visualization allows you to assess the separation and compactness of the clusters in the feature space. In summary, the Silhouette Score of 0.1616 suggests that the clustering is better than random but may not be optimal. The Davies-Bouldin Index of 1.7891 provides another perspective on the clustering quality, but its interpretation depends on the specific context and comparison to other solutions. The visual inspection of the cluster separation through the scatter plot allows for a qualitative assessment of the clustering results.

FACTOR SCORES AND CORRELATION MATRIX

from sklearn.decomposition import PCA print(data.dtypes) numerical_cols = data.select_dtypes(include=['float64', 'int64']) pca = PCA(n_components=5) factor_scores = pca.fit_transform(numerical_cols) for i in range(5): data[f'Factor {i+1}'] = factor_scores[:,i] corr_matrix = data[['Factor 1', 'Factor 2', 'Factor 3', 'Factor 4', 'Factor 5']].corr() print("Correlation matrix of factor scores:") print(corr_matrix) # The code performs Principal Component Analysis (PCA) on numerical data to reduce dimensionality and extract factor scores. # It then adds the factor scores as new columns to the original DataFrame and calculates the correlation matrix between these factors. # This analysis helps to understand the underlying structure and relationships within the dataset.

Run to view results

OVERALL ANALIZATION OF THE PROJECT

The report offers a comprehensive and insightful analysis of the dataset_Superstore.csv dataset using data analytics techniques. The analysis demonstrates the power of data analytics in extracting valuable knowledge from data, enabling businesses to make strategic decisions and optimize their resources. In the data exploration and cleaning phase, we began by importing essential libraries such as pandas, numpy, matplotlib, seaborn, and scikit-learn for analysis. The dataset was loaded using the read_csv() function, and an initial assessment was conducted using the head() and info() methods to understand its structure and contents. We ensured data consistency and accuracy by converting data types and cleaning columns. Missing values were handled, and duplicate records were removed to maintain data integrity. Outliers in numerical features were identified and visualized using boxplots, while summary statistics were calculated to understand the distribution and central tendency of numerical variables. Moving on to exploratory data analysis, we visualized total profits by product category and sub-category to identify high-profit areas and explored relationships between variables through scatter plots. Utilizing K-means clustering, we segmented the data into distinct groups based on sales and profit attributes, analyzing cluster characteristics and identifying cluster centroids to understand each group's distinguishing features. In the dimensionality reduction and factor analysis phase, Principal Component Analysis (PCA) was applied to extract latent factors and reduce dimensionality, with factor scores calculated and added to the original dataset to capture underlying patterns and relationships. A correlation matrix of the factor scores was computed to assess relationships between different dimensions. This comprehensive analysis provided valuable insights into the Superstore sales dataset, enabling stakeholders to understand various aspects of sales performance, identify trends, outliers, and high-profit areas, and make informed decisions. Moving forward, targeted strategies can be developed to optimize sales and maximize profits, focusing on high-profit product categories and sub-categories, understanding customer segments through clustering, and leveraging dimensionality reduction techniques like PCA for future decision-making processes. Further exploration may include predictive modeling, advanced clustering techniques, and continual monitoring and analysis of sales data to adapt strategies in response to changing market dynamics and consumer behavior.

RECOMMENDATION

To optimize the product portfolio, the company should prioritize products that generate the most revenue while considering discontinuing those with poor sales performance. This strategic focus will not only save costs but also better cater to customer needs by offering the most sought-after items. Additionally, analyzing the factors driving product sales can inform targeted promotional campaigns, such as offering discounts on popular items, which can attract new customers and foster loyalty among existing ones. Wickham, H. (2010)Operational efficiency is another key area for improvement, where streamlining order handling processes and optimizing product storage can save both time and money while enhancing customer satisfaction. Continuous monitoring of sales data and market trends is crucial for adapting strategies in real-time, allowing the company to remain agile and responsive to evolving customer preferences. By staying flexible and proactive, the company can maintain its competitive edge and sustain long-term success in the market.

REFERENCES

1. Alghushairy, O., Alsini, R., Soule, T. and Ma, X., 2020. A review of local outlier factor algorithms for outlier detection in big data streams. Big Data and Cognitive Computing, 4(2), p.8. 2. Smiti, A., 2020. A critical overview of outlier detection methods. Computer Science Review, 38, p.100306. 3. Teoh, T.T. and Rong, Z., 2022. Python for Data Analysis. In: Artificial Intelligence with Python. Springer, pp.127-148. 4. Bauer, J.M., Aarestrup, S.C., Hansen, P.G. and Reisch, L.A., 2022. Nudging more sustainable grocery purchases: behavioural innovations in a supermarket setting. Technological Forecasting and Social Change, 180, p.121731. 5. Prell, M., Zanini, M.T., Caldieraro, F. and Migueles, C., 2020. Sustainability certifications and product preference. Marketing Intelligence & Planning, 38(7), pp.840-852. 6. Fan, Y., Kou, J. and Liu, J., 2020, January. Research on the influencing factors of customer loyalty in offline supermarket under new retail model. In: Proceedings of the 2020 4th International Conference on Management Engineering, Software Engineering and Service Sciences. ACM, pp.103-108. 7. MacQueen, J., 1967. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. University of California Press, Berkeley, Calif., pp.281-297. 8. Rousseeuw, P.J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, pp.53-65. 9. Davies, D.L. and Bouldin, D.W., 1979. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, PAMI-1(2), pp.224-227. 10. Kruskal, W.H. and Wallis, W.A., 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), pp.583-621. 11. García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M. and Herrera, F., 2016. Big data preprocessing: methods and prospects. Big Data Analytics, 1(1), p.9. 12. Aloysius, J.A., Hoehle, H., Goodarzi, S. and Venkatesh, V., 2018. Big data initiatives in retail environments: Linking service process perceptions to shopping outcomes. Annals of Operations Research, 270(1), pp.25-51. 13.Wickham, H. (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3-28. 14.Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley. 15. Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027-1035). Society for Industrial and Applied Mathematics

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}INCREASING PROFITS THROUGH DATA ANALYTICS

BIG DATA ANALYTICS-C11BD

INTRODUCTION

METHODOLOGY

IMPORTIG FUNCTIONS

IMPORTING THE GIVEN DATA (dataset_Superstore.csv)

Displaying the whole set of Data

DISPLAYING THE DATASET SUMMARY

DATA TYPE CONVERSION AND COLUMN CLEANING

PLOTING AND REPORTING MISSING VALUES IN A DATA

CHECKING UNIQUENESS OF ROW IDs AND ACCESSING ORDER IDs

HANDLING MISSING VALUES AND DETECTING OUTLIERS

PLOTTING OUTLIERS FOR SALES & QUANTITY

CLEANING AND ANALYSING THE DATA

SAVE A CLEANED DATA AND LOADING TO FILES

SUMMARY STATISTICS AND ANALYSIS OF CLEANED DATA

DATA VERIFICATION AFTER CLEANING

BOXPLOT AFTER CLEAING A DATA

PLOTING TOTAL PROFIT AND CATEGAICAL BAR CHARTS

PLOTING SCATTER PLOT FOR CONTINOUS DATA

K-MEANS CLUSTERING MODEL

IDENTIFYING CLUSTER CENTROIDS USING KMEANS CLUSTERING

SILHOUETTE SCORE, DAVIES-BOULDIN INDEX, AND CLUSTER SEPARATION

FACTOR SCORES AND CORRELATION MATRIX

OVERALL ANALIZATION OF THE PROJECT

RECOMMENDATION

REFERENCES

INCREASING PROFITS THROUGH DATA ANALYTICS