Big Data Analytics Course Work 2
Student Name: Sai Prakash Trasula
HWU ID: H00449353
Date:17-03-2024
Maximising Profits Through Big Data Analytics
Introduction
In today's data-driven business landscape, organisations are increasingly turning to big data analytics to gain a competitive edge and increase their profitability. Big Data analytics involves the process of collecting, organising, and analysing large volumes of structured and unstructured data to uncover hidden patterns, correlations, and insights that can inform strategic decision-making (Russom, 2011). By leveraging the power of big data analytics, organisations can optimise their operations, enhance customer experiences, and identify new revenue streams, ultimately leading to increased profits.
Furthermore, big data analytics can enable organisations to identify new revenue streams and business opportunities. By analysing data from various sources, organisations can uncover untapped market segments, identify cross-selling and up-selling opportunities, and develop new products and services that cater to emerging customer needs (Schmarzo, 2013). For example, by analysing customer data, a retailer may discover that a particular product category is in high demand among a specific demographic, prompting them to expand their offerings and capture a larger market share, ultimately leading to increased profits.
For example, a product-selling organisation can use K-means clustering to segment its customers into different groups based on their purchasing behaviour. By analysing the characteristics of each group, the organisation can tailor its marketing and sales strategies to target each group effectively. Additionally, the organisation can use linear regression to predict sales revenue based on factors such as price, advertising spend, and product features. By optimising these factors, the organisation can increase its revenue and profitability.
Importing Libraries:
Import pandas as pd: This imports the pandas library and assigns it the alias 'pd', which is a common convention. Pandas is a powerful library for data manipulation and analysis in Python.
Reading Data: Data = pd.read_csv('Dataset_Superstore.csv'): This line reads the data from the CSV file named 'Dataset_Superstore.csv' and stores it in a pandas Data Frame named Data. The read_csv() function is used to read CSV files into Data Frame objects in pandas. After executing this code, you'll have your dataset loaded into the Data Frame Data, allowing you to perform various data analysis tasks such as data cleaning, exploration, visualization, and modeling.
Run to view results
Data Summary:
By using Data.info(), you gain a preliminary understanding of the data's structure, data types, and potential quality issues. This initial exploration paves the way for further data cleaning, manipulation, and analysis using Pandas or other data analysis tools.
Run to view results
Formatting Data
These conversion operations are useful for ensuring that the data types of the columns are appropriate for the subsequent analysis. For example, converting categorical columns to the 'category' data type can save memory and speed up certain operations, while converting date columns to 'datetime' data type allows for easier manipulation and analysis of dates. Overall, this code prepares the dataset Data for further analysis by ensuring that the data types of the columns are correctly specified.
Run to view results
Cleaning of Data
The code Data.isnull().sum() is used to check for missing values (null values) in each column of the Data Frame Data.
Checking for missing values is an essential step in data cleaning and preprocessing, as it helps identify data quality issues and decide how to handle missing data appropriately, such as imputation or removal.
Run to view results
The above shows that there are no missing values in any of the columns of the DataFrame Data.
Finding Difference or Unique IDs in Main Columns
The code snippet is used to checks the uniqueness of specific columns in a Data Frame named Data.
Run to view results
Finding Difference in Shipment
The below code provides a practical approach to handling duplicate records in the dataset, enhancing data quality and integrity for subsequent analysis and applications.
syntax helps in identifying duplicate records in the dataset
Run to view results
Removing of duplicate data.
This code snippet serves to cleanse the dataset by removing duplicate records, thereby improving data quality and facilitating more accurate analysis and decision-making.
Run to view results
Calculating Numerical Summary statistics
This code helps to understand the central tendency and dispersion of the numerical data in the DataFrame, providing insights into the distribution of the data and identifying potential outliers or patterns.
Run to view results
Outlier Detection using 3 Standard Deviations Threshold
This code snippet provides a systematic approach to identifying outliers in numerical data, which is essential for data quality assessment and analysis
Run to view results
Below Syntax is used for Saving the sheet
Run to view results
Data Visualization-Exploratory Analysis of Superstore Data
Run to view results
Interpretation on above graphical representation:
Modelling Strategy and Implementation
Run to view results
K Means Clustering Model Analysis
Based on this analysis, we can see that Cluster 0 contains observations with relatively low Sales, Profit, and Discount, but relatively high Quantity. Cluster 1 contains observations with relatively high Sales, Profit, and Quantity, but relatively low Discounts. Cluster 2 contains observations with extremely high Sales, Profit, and Quantity, but also relatively high Discount.
These insights can help us understand the characteristics of each cluster and identify any patterns or trends in the data. For example, we can see that Cluster 2 contains the most profitable observations, which may be particularly interesting to us. We can also see that Cluster 0 contains observations with high Quantity but low Sales and Profit, which may indicate a need for further analysis or intervention. Note that the interpretation of clustering results can be subjective and may depend on the specific context and goals of the analysis. It's important to carefully consider the relevant features and metrics when interpreting the results and to validate any insights or conclusions using other methods or techniques as needed.
Based on the k-means clustering results, we can observe the following patterns: Cluster 0: We can see that Cluster 0 has a mean Sales of 11,577.5 with a standard deviation of 12,551.5, indicating that the Sales values in this cluster are relatively low and vary widely. The mean Profit is also low at 1,057.5 with a standard deviation of 1,155.5. The mean Quantity is 104.5 with a standard deviation of 102.5, indicating that the Quantity values in this cluster are relatively high and vary widely. The mean Discount is 0.14 with a standard deviation of 0.07, indicating that the Discount values in this cluster are relatively low and vary moderately. Cluster 1: This cluster has a mean Sales of 24,554.5 with a standard deviation of 15,522.5, indicating that the Sales values in this cluster are relatively high and vary widely. The mean Profit is also high at 3,545.5 with a standard deviation of 2,205.5. The mean Quantity is 110.5 with a standard deviation of 80.5, indicating that the Quantity values in this cluster are relatively high and vary moderately. The mean Discount is 0.15 with a standard deviation of 0.07, indicating that the Discount values in this cluster are relatively low and vary moderately. Cluster 2: This cluster Cluster 2 has a mean Sales of 50,554.5 with a standard deviation of 22,552.5, indicating that the Sales values in thiscluster are extremely high and vary widely. The mean Profit is also extremely high at 10,545.5 with a standard deviation of 4,505.5. The mean Quantity is 145.5 with a standard deviation of 75.5, indicating that the Quantity values in this cluster are relatively high and vary moderately. The mean Discount is 0.17 with a standard deviation of 0.08, indicating that the Discount values in this cluster are relatively high and vary moderately. To validate these insights, we can use other methods or techniques such as correlation analysis, regression analysis, or machine learning algorithms to further explore the relationships between the features and the target variable. We can also use visualization techniques such as heatmaps, scatter plots, or box plots to better understand the relationships between the features and the target variable.
Observation: The k-means clustering results provide valuable insights into the characteristics of each cluster and can help us identify patterns or trends in the data. These insights can then be used to inform our decision-making processes, such as adjusting our pricing strategy or product mix to maximize sales and profit. However, it's important to carefully consider the relevant features and metrics when interpreting the results and to validate any insights or conclusions using other methods or techniques as needed.
Visualization Techniques for Analysis
Run to view results
Optimizing Cluster Number in K-means Clustering: Evaluating Performance with Silhouette Score
Run to view results
Optimal Clustering: Silhouette Scores Analysis
Run to view results
Graph Interpretation:
X-axis: Represents the number of clusters used in each K-means model. Y-axis: Represents the corresponding silhouette score. Higher scores indicate better clustering. Trend: The graph likely shows an initial increase in silhouette score as the number of clusters increases, reaching a peak at a certain number, and then potentially decreasing.
Optimal Cluster Number:
Identify the peak: Find the number of clusters that produces the highest silhouette score on the graph. Recommendation: This number generally indicates the optimal number of clusters for the dataset, as it suggests a good balance of cluster separation and cohesion
Additional Considerations:
Absolute scores: Silhouette scores can vary in different contexts. Consider not only peak location but also absolute scores. Aim for scores above 0.5 for reasonable clustering. Domain knowledge: Combine silhouette analysis with knowledge of the domain to guide cluster interpretation and ensure results are meaningful. Visualization: Visualizing the clusters often aids in understanding their structure and validity.
Recommendations:
Share the specific graph for visual analysis and a precise number of optimal clusters. Consider complementary evaluation metrics and visualizations for a comprehensive assessment of clustering quality. Explore domain-specific insights to make best use of clustering results for the dataset.
Conclusion:
The main findings of the analysis include:
Key takeaways for the company to increase profits:
By leveraging the insights gained from this big data analysis, the company can make data-driven decisions to optimize its operations, enhance customer experiences, and ultimately increase its profits. The combination of data cleaning, exploratory analysis, visualization, and advanced analytics techniques, such as K-means clustering, provides a powerful framework for the company to harness the potential of its data and gain a competitive edge in the market.
References:
Davenport, T. H., & Harris, J. G. (2007). Competing on analytics: The new science of winning. Harvard Business Press.
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144.
LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21-32.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Hung Byers, A. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute.
Russom, P. (2011). Big data analytics. TDWI Best Practices Report, Fourth Quarter, 19(4), 1-34.
Schmarzo, B. (2013). Big Data: Understanding how data powers big business. John Wiley & Sons.