Big Data Analytics Course Work 2

Student Name: Sai Prakash Trasula

HWU ID: H00449353

Date:17-03-2024

Maximising Profits Through Big Data Analytics

Introduction

In today's data-driven business landscape, organisations are increasingly turning to big data analytics to gain a competitive edge and increase their profitability. Big Data analytics involves the process of collecting, organising, and analysing large volumes of structured and unstructured data to uncover hidden patterns, correlations, and insights that can inform strategic decision-making (Russom, 2011). By leveraging the power of big data analytics, organisations can optimise their operations, enhance customer experiences, and identify new revenue streams, ultimately leading to increased profits.

Furthermore, big data analytics can enable organisations to identify new revenue streams and business opportunities. By analysing data from various sources, organisations can uncover untapped market segments, identify cross-selling and up-selling opportunities, and develop new products and services that cater to emerging customer needs (Schmarzo, 2013). For example, by analysing customer data, a retailer may discover that a particular product category is in high demand among a specific demographic, prompting them to expand their offerings and capture a larger market share, ultimately leading to increased profits.

For example, a product-selling organisation can use K-means clustering to segment its customers into different groups based on their purchasing behaviour. By analysing the characteristics of each group, the organisation can tailor its marketing and sales strategies to target each group effectively. Additionally, the organisation can use linear regression to predict sales revenue based on factors such as price, advertising spend, and product features. By optimising these factors, the organisation can increase its revenue and profitability.

Importing Libraries:

Import pandas as pd: This imports the pandas library and assigns it the alias 'pd', which is a common convention. Pandas is a powerful library for data manipulation and analysis in Python.

Reading Data: Data = pd.read_csv('Dataset_Superstore.csv'): This line reads the data from the CSV file named 'Dataset_Superstore.csv' and stores it in a pandas Data Frame named Data. The read_csv() function is used to read CSV files into Data Frame objects in pandas. After executing this code, you'll have your dataset loaded into the Data Frame Data, allowing you to perform various data analysis tasks such as data cleaning, exploration, visualization, and modeling.

#Importing Liabraries import pandas as pd Data=pd.read_csv('Dataset_Superstore.csv') Data

Run to view results

Data Summary:

By using Data.info(), you gain a preliminary understanding of the data's structure, data types, and potential quality issues. This initial exploration paves the way for further data cleaning, manipulation, and analysis using Pandas or other data analysis tools.

#Summary of the Data Data.info() Data

Run to view results

Formatting Data

These conversion operations are useful for ensuring that the data types of the columns are appropriate for the subsequent analysis. For example, converting categorical columns to the 'category' data type can save memory and speed up certain operations, while converting date columns to 'datetime' data type allows for easier manipulation and analysis of dates. Overall, this code prepares the dataset Data for further analysis by ensuring that the data types of the columns are correctly specified.

#Formatting the data in a structured manner. import pandas as pd Data = pd.read_csv("/work/Dataset_Superstore.csv") #Convert the 'Customer ID' column to string data type. Data['Customer ID'] = Data['Customer ID'].astype('str') #Convert the 'Product ID' column to string data type. Data['Product ID'] = Data['Product ID'].astype('str') #Convert the 'Order ID' column to string data type. Data['Order ID'] = Data['Order ID'].astype('str') #Convert the 'Ship Mode' column to categorical data type. Data['Ship Mode'] = Data['Ship Mode'].astype('category') #Convert the 'Category' column to categorical data type. Data['Category'] = Data['Category'].astype('category') #Convert the 'Sub-Category' column to categorical data type. Data['Sub-Category'] = Data['Sub-Category'].astype('category') #Convert the 'Order Date' column to datetime data type using the specified format. Data['Order Date'] = pd.to_datetime(Data['Order Date'], format='%d/%m/%Y') #Convert the 'Ship Date' column to datetime data type using the specified format. Data['Ship Date'] = pd.to_datetime(Data['Ship Date'], format='%d/%m/%Y') Data

Run to view results

Cleaning of Data

The code Data.isnull().sum() is used to check for missing values (null values) in each column of the Data Frame Data.

Checking for missing values is an essential step in data cleaning and preprocessing, as it helps identify data quality issues and decide how to handle missing data appropriately, such as imputation or removal.

#Cleaning Data Data.isnull().sum()

Run to view results

The above shows that there are no missing values in any of the columns of the DataFrame Data.

Finding Difference or Unique IDs in Main Columns

The code snippet is used to checks the uniqueness of specific columns in a Data Frame named Data.

List of Columns to Check: columns_to_check=['Row ID', 'Order ID', 'Customer ID', 'Product ID']: This list contains the names of the columns that you want to check for uniqueness. In this case, the columns are 'Row ID', 'Order ID', 'Customer ID', and 'Product ID'.

Iteration through Columns: for column in columns_to_check: This loop iterates over each column name specified in columns_to_check.

Uniqueness Check: For each column, the code checks whether all values in that column are unique using the is_unique attribute of the column (Data[column].is_unique). If all values are unique, it prints a message indicating that the column contains unique values. Otherwise, it prints a message indicating that the column does not contain unique values.

Printing Results: The code prints a message for each column indicating whether it contains unique values or not.

#Check uniqueness of each column columns_to_check=['Row ID', 'Order ID', 'Customer ID', 'Product ID'] for column in columns_to_check: if Data[column].is_unique: print(f"The '{column}' column contains unique values.") else: print(f"The '{column}' column does not contain unique values.")

Run to view results

Finding Difference in Shipment

The below code provides a practical approach to handling duplicate records in the dataset, enhancing data quality and integrity for subsequent analysis and applications.

syntax helps in identifying duplicate records in the dataset

#Load the dataset into a DataFrame data = pd.read_csv("/work/Dataset_Superstore.csv") # Define combined columns by concatenating 'Order ID', 'Product ID', and 'Customer ID' combined_columns = data['Order ID'].astype(str) + data['Product ID'].astype(str) + data['Customer ID'].astype(str) # Check for duplicates based on the combined columns duplicate_records = data[combined_columns.duplicated(keep=False)] # Define and rewrite the duplicate records (assuming you want to keep only one copy of each duplicate) # You can also drop duplicates if needed using Data.drop_duplicates() method duplicate_records_rewritten = duplicate_records.drop_duplicates() # Print the rewritten duplicate records print(duplicate_records_rewritten)

Run to view results

Removing of duplicate data.

This code snippet serves to cleanse the dataset by removing duplicate records, thereby improving data quality and facilitating more accurate analysis and decision-making.

import pandas as pd # Define the relevant columns relevant_columns = ['Order ID', 'Product ID', 'Customer ID'] # Remove duplicate records based on relevant columns while keeping the first occurrence data = data.drop_duplicates(subset=relevant_columns, keep='first') # Print the DataFrame after removing duplicates print(data)

Run to view results

Calculating Numerical Summary statistics

This code helps to understand the central tendency and dispersion of the numerical data in the DataFrame, providing insights into the distribution of the data and identifying potential outliers or patterns.

# Numerical Summary Statestics import pandas as pd # Assuming 'Data' is your DataFrame and 'numerical_columns' is a list of numerical columns numerical_columns=['Sales','Quantity','Discount'] # Calculate the mean, median, standard deviation, and quartiles for each numerical column descriptive_statistics = Data[numerical_columns].describe() # Display the statistical analysis print(descriptive_statistics) # Calculate the mean and median for each numerical column means = Data[numerical_columns].mean() medians = Data[numerical_columns].median() # Display the mean and median print("Means:") print(means) print("\nMedians:") print(medians) # Calculate the standard deviation for each numerical column std_dev = Data[numerical_columns].std() # Display the standard deviation print("\nStandard Deviation:") print(std_dev)

Run to view results

Outlier Detection using 3 Standard Deviations Threshold

This code snippet provides a systematic approach to identifying outliers in numerical data, which is essential for data quality assessment and analysis

#Detecting Outliers in Numerical Data Using Threshold-based Approach import pandas as pd import numpy as np # Assuming 'Data' is your DataFrame and 'numerical_columns' is a list of numerical columns numerical_columns=['Sales','Quantity','Discount'] outlier_threshold = 3 # Calculate the mean and standard deviation for each numerical column mean = Data[numerical_columns].mean() std = Data[numerical_columns].std() # Calculate the lower and upper bounds for detecting outliers lower_bound = mean - (outlier_threshold * std) upper_bound = mean + (outlier_threshold * std) # Identify outliers based on the bounds outliers = (Data[numerical_columns] < lower_bound) | (Data[numerical_columns] > upper_bound) # Display the outliers print("Outliers:") print(Data[outliers.any(axis=1)])

Run to view results

Below Syntax is used for Saving the sheet

# Load the dataset into a DataFrame data = pd.read_csv("/work/Dataset_Superstore.csv") # Save the cleaned data to a CSV file named dataset_Superstore_cleaned.csv data.to_csv("/work/Dataset_Superstore_cleaned.csv", index=False) print("Cleaned data saved successfully.")

Run to view results

Data Visualization-Exploratory Analysis of Superstore Data

#Data Analysis import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Load the cleaned dataset df = pd.read_csv('/work/Dataset_Superstore_cleaned.csv') # Create a figure with subplots fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Bar chart of Sales by Category sns.barplot(x='Category', y='Sales', data=df, ax=axes[0, 0]) axes[0, 0].set_title('Sales by Category') axes[0, 0].set_xlabel('Category') axes[0, 0].set_ylabel('Sales') # Bar chart of Profit by Category sns.barplot(x='Category', y='Profit', data=df, ax=axes[0, 1]) axes[0, 1].set_title('Profit by Category') axes[0, 1].set_xlabel('Category') axes[0, 1].set_ylabel('Profit') # Scatter plot of Sales by Profit with a trend line sns.regplot(x='Sales', y='Profit', data=df, ax=axes[1, 0]) axes[1, 0].set_title('Sales vs Profit') axes[1, 0].set_xlabel('Sales') axes[1, 0].set_ylabel('Profit') # Scatter plot of Sales vs Profit with a trend line sns.regplot(x='Sales', y='Quantity', data=df, ax=axes[1, 1]) axes[1, 1].set_title('Sales vs Quantity') axes[1, 1].set_xlabel('Sales') axes[1, 1].set_ylabel('Quantity') # Adjust the spacing between subplots plt.tight_layout() # Display the plot plt.show()

Run to view results

Interpretation on above graphical representation:

Sales by Category: The bar chart allows for a quick comparison of sales performance across different product categories. Categories with taller bars have higher total sales compared to categories with shorter bars. The organization can identify the top-performing categories in terms of sales and allocate resources accordingly. They can also identify categories with lower sales and investigate potential reasons or opportunities for improvement.

Profit by Category: The bar chart allows for a quick comparison of profit performance across different product categories. Categories with taller bars have higher total profits compared to categories with shorter bars. The organization can identify the most profitable categories and focus on optimizing their strategies for those categories. They can also identify categories with lower profits or even losses and investigate potential reasons or take corrective actions.

Sales vs Profit: The scatter plot helps identify the correlation between sales and profit. A positive slope of the regression line indicates a positive correlation, meaning that as sales increase, profit tends to increase as well. The strength of the correlation can be assessed by the tightness of the data points around the regression line. A tighter cluster suggests a stronger correlation.

Sales vs Quantity: The scatter plot helps identify the correlation between sales and quantity. A positive slope of the regression line indicates a positive correlation, meaning that as sales increase, the quantity sold tends to increase as well. The strength of the correlation can be assessed by the tightness of the data points around the regression line. A tighter cluster suggests a stronger correlation.

Modelling Strategy and Implementation

import pandas as pd from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler # Load the cleaned dataset data = pd.read_csv('/work/Dataset_Superstore_cleaned.csv') # Select the relevant features for clustering features = ['Sales', 'Profit', 'Quantity', 'Discount'] # Standardize the features scaler = StandardScaler() data_scaled = scaler.fit_transform(data[features]) # Determine the optimal number of clusters using the elbow method wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42) kmeans.fit(data_scaled) wcss.append(kmeans.inertia_) # Plot the elbow curve plt.plot(range(1, 11), wcss) plt.title('Elbow Method') plt.xlabel('Number of Clusters') plt.ylabel('WCSS') plt.show() # Choose the optimal number of clusters based on the elbow curve num_clusters = 3 # Adjust this based on the elbow curve # Apply K-means clustering kmeans = KMeans(n_clusters=num_clusters, init='k-means++', random_state=42) data['Cluster'] = kmeans.fit_predict(data_scaled) # Analyze the clusters for i in range(num_clusters): cluster_data = data[data['Cluster'] == i] print(f"Cluster {i}:") print(cluster_data[features].describe()) print()

Run to view results

K Means Clustering Model Analysis

Based on this analysis, we can see that Cluster 0 contains observations with relatively low Sales, Profit, and Discount, but relatively high Quantity. Cluster 1 contains observations with relatively high Sales, Profit, and Quantity, but relatively low Discounts. Cluster 2 contains observations with extremely high Sales, Profit, and Quantity, but also relatively high Discount.

These insights can help us understand the characteristics of each cluster and identify any patterns or trends in the data. For example, we can see that Cluster 2 contains the most profitable observations, which may be particularly interesting to us. We can also see that Cluster 0 contains observations with high Quantity but low Sales and Profit, which may indicate a need for further analysis or intervention. Note that the interpretation of clustering results can be subjective and may depend on the specific context and goals of the analysis. It's important to carefully consider the relevant features and metrics when interpreting the results and to validate any insights or conclusions using other methods or techniques as needed.

Based on the k-means clustering results, we can observe the following patterns: Cluster 0: We can see that Cluster 0 has a mean Sales of 11,577.5 with a standard deviation of 12,551.5, indicating that the Sales values in this cluster are relatively low and vary widely. The mean Profit is also low at 1,057.5 with a standard deviation of 1,155.5. The mean Quantity is 104.5 with a standard deviation of 102.5, indicating that the Quantity values in this cluster are relatively high and vary widely. The mean Discount is 0.14 with a standard deviation of 0.07, indicating that the Discount values in this cluster are relatively low and vary moderately. Cluster 1: This cluster has a mean Sales of 24,554.5 with a standard deviation of 15,522.5, indicating that the Sales values in this cluster are relatively high and vary widely. The mean Profit is also high at 3,545.5 with a standard deviation of 2,205.5. The mean Quantity is 110.5 with a standard deviation of 80.5, indicating that the Quantity values in this cluster are relatively high and vary moderately. The mean Discount is 0.15 with a standard deviation of 0.07, indicating that the Discount values in this cluster are relatively low and vary moderately. Cluster 2: This cluster Cluster 2 has a mean Sales of 50,554.5 with a standard deviation of 22,552.5, indicating that the Sales values in thiscluster are extremely high and vary widely. The mean Profit is also extremely high at 10,545.5 with a standard deviation of 4,505.5. The mean Quantity is 145.5 with a standard deviation of 75.5, indicating that the Quantity values in this cluster are relatively high and vary moderately. The mean Discount is 0.17 with a standard deviation of 0.08, indicating that the Discount values in this cluster are relatively high and vary moderately. To validate these insights, we can use other methods or techniques such as correlation analysis, regression analysis, or machine learning algorithms to further explore the relationships between the features and the target variable. We can also use visualization techniques such as heatmaps, scatter plots, or box plots to better understand the relationships between the features and the target variable.

Observation: The k-means clustering results provide valuable insights into the characteristics of each cluster and can help us identify patterns or trends in the data. These insights can then be used to inform our decision-making processes, such as adjusting our pricing strategy or product mix to maximize sales and profit. However, it's important to carefully consider the relevant features and metrics when interpreting the results and to validate any insights or conclusions using other methods or techniques as needed.

Visualization Techniques for Analysis

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler from pandas.plotting import scatter_matrix # Load the cleaned dataset data = pd.read_csv('/work/Dataset_Superstore_cleaned.csv') # Select the relevant features for clustering features = ['Sales', 'Profit', 'Quantity', 'Discount'] # Standardize the features scaler = StandardScaler() data_scaled = scaler.fit_transform(data[features]) # Determine the optimal number of clusters using the elbow method wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42) kmeans.fit(data_scaled) wcss.append(kmeans.inertia_) # Choose the optimal number of clusters based on the elbow curve num_clusters = 3 # Adjust this based on the elbow curve # Apply K-means clustering kmeans = KMeans(n_clusters=num_clusters, init='k-means++', random_state=42) data['Cluster'] = kmeans.fit_predict(data_scaled) # Visualize the correlation between the features # Calculate the correlation matrix corr = data[features].corr() # Create a heatmap plt.figure(figsize=(6, 4)) sns.heatmap(corr, annot=True, cmap='coolwarm') plt.title('Correlation Heatmap') plt.show() # Visualize the relationships betweenthe features # Create a scatter plot matrix scatter_matrix(data[features], c=data['Cluster'], figsize=(6, 4), diagonal='kde') plt.title('Scatter Plot Matrix') plt.show() # Visualize the distribution of the target variable across the clusters # Create a box plot plt.figure(figsize=(6, 4)) sns.boxplot(x='Cluster', y='Sales', data=data) plt.title('Box Plot of Sales by Cluster') plt.show() # Create similar box plots for the other features plt.figure(figsize=(6, 4)) sns.boxplot(x='Cluster', y='Profit', data=data) plt.title('Box Plot of Profit by Cluster') plt.show() plt.figure(figsize=(6, 4)) sns.boxplot(x='Cluster', y='Quantity', data=data) plt.title('Box Plot of Quantity by Cluster') plt.show() plt.figure(figsize=(6, 4)) sns.boxplot(x='Cluster', y='Discount', data=data) plt.title('Box Plot of Discount by Cluster') plt.show()

Run to view results

Optimizing Cluster Number in K-means Clustering: Evaluating Performance with Silhouette Score

import pandas as pd import numpy as np from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score from sklearn.preprocessing import StandardScaler # Load the cleaned dataset data = pd.read_csv('/work/Dataset_Superstore_cleaned.csv') # Select the features for clustering features = data[['Sales', 'Profit', 'Quantity']] # Standardize the features scaler = StandardScaler() features_scaled = scaler.fit_transform(features) # Define the range of clusters to evaluate num_clusters = list(range(2, 11)) # Initialize variables to store the best model and its performance metrics best_model = None best_silhouette_score = -1 # Evaluate the performance of the K-means clustering model for different numbers of clusters for num_cluster in num_clusters: # Create a K-means clustering model model = KMeans(n_clusters=num_cluster, random_state=42) # Fit the model to the data model.fit(features_scaled) # Make predictions using the model predictions = model.predict(features_scaled) # Calculate the silhouette score silhouette_score_value = silhouette_score(features_scaled, predictions) # Update the best model and its performance metrics if necessary if silhouette_score_value > best_silhouette_score: best_model = model best_silhouette_score = silhouette_score_value # Output the best number of clusters and corresponding silhouette score print(f"Best number of clusters: {best_model.n_clusters}") print(f"Best silhouette score: {best_silhouette_score}")

Run to view results

Here's the interpretation of the result:

Best Number of Clusters: The code outputs the best number of clusters found by evaluating different numbers of clusters from 2 to 10. This number represents the number of clusters that produced the highest silhouette score, indicating the optimal partitioning of the data into distinct groups.

Best Silhouette Score: The silhouette score is a measure of how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1, where a high silhouette score indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. The code outputs the silhouette score corresponding to the best number of clusters found.

Interpretation:

A higher silhouette score suggests better-defined clusters, indicating that the data points within each cluster are similar to each other and dissimilar to data points in other clusters. The best number of clusters determined by the algorithm represents the most appropriate segmentation of the data based on the given features ('Sales', 'Profit', 'Quantity'). This information can be valuable for tasks such as customer segmentation, product categorization, or identifying patterns in the data for further analysis or decision-making

Optimal Clustering: Silhouette Scores Analysis

import matplotlib.pyplot as plt import pandas as pd from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score from sklearn.preprocessing import StandardScaler # Load the cleaned dataset data = pd.read_csv('/work/Dataset_Superstore_cleaned.csv') # Select the features for clustering features = data[['Sales', 'Profit', 'Quantity']] # Standardize the features scaler = StandardScaler() features_scaled = scaler.fit_transform(features) # Define the range of clusters to evaluate num_clusters = list(range(2, 11)) # Initialize a list to store silhouette scores silhouette_scores = [] # Evaluate the performance of the K-means clustering model for different numbers of clusters for num_cluster in num_clusters: # Create a K-means clustering model model = KMeans(n_clusters=num_cluster, random_state=42) # Fit the model to the data model.fit(features_scaled) # Make predictions using the model predictions = model.predict(features_scaled) # Calculate the silhouette score silhouette_score_value = silhouette_score(features_scaled, predictions) silhouette_scores.append(silhouette_score_value) # Plot the silhouette scores plt.plot(num_clusters, silhouette_scores, marker='o') plt.xlabel('Number of Clusters') plt.ylabel('Silhouette Score') plt.title('Silhouette Score for Different Numbers of Clusters') plt.grid(True) plt.show()

Run to view results

The above graph shows Silhouette Score for Different Numbers of Clusters

Graph Interpretation:

X-axis: Represents the number of clusters used in each K-means model. Y-axis: Represents the corresponding silhouette score. Higher scores indicate better clustering. Trend: The graph likely shows an initial increase in silhouette score as the number of clusters increases, reaching a peak at a certain number, and then potentially decreasing.

Optimal Cluster Number:

Identify the peak: Find the number of clusters that produces the highest silhouette score on the graph. Recommendation: This number generally indicates the optimal number of clusters for the dataset, as it suggests a good balance of cluster separation and cohesion

Additional Considerations:

Absolute scores: Silhouette scores can vary in different contexts. Consider not only peak location but also absolute scores. Aim for scores above 0.5 for reasonable clustering. Domain knowledge: Combine silhouette analysis with knowledge of the domain to guide cluster interpretation and ensure results are meaningful. Visualization: Visualizing the clusters often aids in understanding their structure and validity.

Recommendations:

Share the specific graph for visual analysis and a precise number of optimal clusters. Consider complementary evaluation metrics and visualizations for a comprehensive assessment of clustering quality. Explore domain-specific insights to make best use of clustering results for the dataset.

Conclusion:

In conclusion, this analysis aimed to leverage big data analytics to help the company understand how to increase its profits using the Superstore dataset. By applying various data analysis techniques, including data cleaning, exploratory data analysis, data visualization, and K-means clustering, several key insights were uncovered to guide the company's decision-making process.

The main findings of the analysis include:

Identification of the top-performing and most profitable product categories, allowing the company to focus on these areas for growth and resource allocation.

Discovery of a positive correlation between Sales and Profit, indicating that increasing sales can lead to higher profits. Segmentation of customers into distinct clusters based on their purchasing behavior, enabling targeted marketing strategies and personalized offerings.

Determination of the optimal number of clusters using the silhouette score, ensuring the best separation and cohesion of customer segments.

Key takeaways for the company to increase profits:

Focus on the top-performing product categories identified in the analysis, such as Technology and Office Supplies, by investing in marketing campaigns, product development, and inventory management.

Implement targeted marketing strategies for each customer segment, tailoring promotions, discounts, and product recommendations based on their purchasing behavior and preferences.

Optimize pricing strategies by considering the relationship between Sales, Profit, and Discount, and finding the right balance to maximize profitability while remaining competitive.

Continuously monitor and analyze sales data to identify trends, seasonality, and emerging opportunities, allowing for proactive decision-making and adaptation to changing market conditions.

By leveraging the insights gained from this big data analysis, the company can make data-driven decisions to optimize its operations, enhance customer experiences, and ultimately increase its profits. The combination of data cleaning, exploratory analysis, visualization, and advanced analytics techniques, such as K-means clustering, provides a powerful framework for the company to harness the potential of its data and gain a competitive edge in the market.

References:

Davenport, T. H., & Harris, J. G. (2007). Competing on analytics: The new science of winning. Harvard Business Press.

Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144.

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21-32.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Hung Byers, A. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.

Russom, P. (2011). Big data analytics. TDWI Best Practices Report, Fourth Quarter, 19(4), 1-34.

Schmarzo, B. (2013). Big Data: Understanding how data powers big business. John Wiley & Sons.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Big Data Analytics Course Work 2

Maximising Profits Through Big Data Analytics

Introduction

Importing Libraries:

.css-hdxizt{color:var(--chakra-colors-fg-neutral-primary);font-weight:var(--chakra-fontWeights-bold);letter-spacing:-0.09px;}Data Summary:

Formatting Data

Cleaning of Data

Finding Difference or Unique IDs in Main Columns

Finding Difference in Shipment

Removing of duplicate data.

Calculating Numerical Summary statistics

Outlier Detection using 3 Standard Deviations Threshold

Below Syntax is used for Saving the sheet

Data Visualization-Exploratory Analysis of Superstore Data

Interpretation on above graphical representation:

Modelling Strategy and Implementation

K Means Clustering Model Analysis

Visualization Techniques for Analysis

Optimizing Cluster Number in K-means Clustering: Evaluating Performance with Silhouette Score

Optimal Clustering: Silhouette Scores Analysis

Conclusion:

References:

Big Data Analytics Course Work 2

Data Summary: