INCREASING PROFITS THROUGH DATA ANALYTICS
BIG DATA ANALYTICS-C11BD
INTRODUCTION
The document "Increasing Profits through Data Analytics" offers a detailed examination of the dataset_Superstore.csv using a variety of data analytics methods. Its main goal is to pinpoint the factors that influence the company's profitability and offer data-driven insights for strategic decision-making. It underscores the significance of data analytics in today's business landscape, where informed decisions based on data are crucial for operational enhancement, revenue growth, and competitive advantage.
This document showcases how data analytics techniques are practically applied to real-world business situations, serving as a valuable tool for understanding how data-driven insights can bolster a company's expansion and enhance long-term profitability in a competitive market. The analysis delves into uncovering trends and patterns that impact profitability through data cleaning, summary statistics examination, and visualization creation. Moreover, modeling techniques are utilized to identify the primary drivers of profitability and provide recommendations based on the analysis results.
This passage underscores the importance of leveraging data analysis for making informed business decisions in today's operational landscape. By utilizing data analytics, businesses can extract valuable insights from their data, enabling them to make strategic decisions and optimize resource utilization effectively. Through a meticulous evaluation of modeling strategies and a detailed interpretation of results, a deep understanding and practical application to real business challenges are demonstrated.
METHODOLOGY
To ensure the accuracy and reliability of the findings, the analysis follows a structured methodology. The initial step involves data cleaning to ensure the quality and consistency of the dataset by identifying and eliminating outliers and data entry errors. This step is crucial as it lays the groundwork for precise and meaningful insights.
Subsequently, summary statistics are computed for the cleaned data to uncover details about its distribution, variability, and composition. Further understanding of the data is achieved by exploring relationships between variables and creating visualizations such as categorical bar charts and continuous scatter plots.
The primary modeling technique employed in the analysis is K-means clustering, which is utilized to identify customer segments based on their purchasing behavior and assess each cluster's profitability. The elbow method is employed to determine the optimal number of clusters, while metrics like the Davies-Bouldin Index and Silhouette Score are used to evaluate clustering outcomes. Additionally, the Kruskal-Wallis test is performed to showcase significant differences in average profit among clusters, highlighting the importance of customer segmentation in understanding profitability trends.
IMPORTIG FUNCTIONS
Let's import the relevant libraries and functions to get the analysis started for a give data.
Run to view results
By assisting with data analysis, model construction, and performance evaluation, these features and tools make our work simpler and more effective.
IMPORTING THE GIVEN DATA (dataset_Superstore.csv)
Run to view results
Displaying the whole set of Data
Run to view results
DISPLAYING THE DATASET SUMMARY
Run to view results
DATA TYPE CONVERSION AND COLUMN CLEANING
Run to view results
PLOTING AND REPORTING MISSING VALUES IN A DATA
Run to view results
CHECKING UNIQUENESS OF ROW IDs AND ACCESSING ORDER IDs
Run to view results
HANDLING MISSING VALUES AND DETECTING OUTLIERS
Run to view results
PLOTTING OUTLIERS FOR SALES & QUANTITY
Run to view results
CLEANING AND ANALYSING THE DATA
Run to view results
SAVE A CLEANED DATA AND LOADING TO FILES
Run to view results
SUMMARY STATISTICS AND ANALYSIS OF CLEANED DATA
Run to view results
DATA VERIFICATION AFTER CLEANING
Run to view results
BOXPLOT AFTER CLEAING A DATA
Run to view results
Boxplot of Cleaned Data: Boxplot of Cleaned Data: The initial visualization is a boxplot that shows how the cleaned dataset's numerical features are distributed. It sheds light on the dataset's overall central tendency, dispersion, and outlier presence. The x-axis shows the various numerical features, while the y-axis shows the range of values for each feature. If there are any outliers, they are shown as individual data points that are outside the boxplot's.
PLOTING TOTAL PROFIT AND CATEGAICAL BAR CHARTS
Run to view results
Total Profit by Product Category: The total profit produced by each product category is shown in this bar chart.Different product categories are shown on the x-axis, and the total profit is shown on the y-axis. The height of each bar represents the total profit for a particular product category. Light blue coloration is used on the bars to improve visual appeal. For easier reading, the product categories on the x-axis labels have been rotated by 45 degrees. The title of the chart, "Total Profit by Product Category," gives the visualization a clear context. Total Profit by Product Sub-Category: The total profit made by each product subcategory is shown in this bar chart.Like in the prior chart, the total profit is shown on the y-axis, while various subcategories are represented on the x-axis.The height of each bar represents the total profit for that specific subcategory. For distinction, the bars are colored a light green. The x-axis labels (sub-categories) are rotated by ninety degrees to enhance readability. This chart is titled "Total Profit by Product Sub-Category."
PLOTING SCATTER PLOT FOR CONTINOUS DATA
Run to view results
Scatter Plot with Regression Line: ->Shows data points scattered on a plot ->A straight line is fitted through the points to show the overall trend ->The x-axis represents one variable (X), and the y-axis represents another variable (Y) ->The line indicates the relationship between X and Y Scatter Plot with Categorical Colors: ->Data points are colored differently based on categories (A, B, C) ->Each color represents a different category ->Helps visualize patterns across different categories Scatter Plot with Varying Point Sizes: ->Point sizes vary based on a third variable ->Larger points represent higher values of the third variable ->Shows the relationship between two variables while incorporating a third variable Sales vs. Profit Scatter Plot with Trend Line: ->Specific plot showing the relationship between Sales (x-axis) and Profit (y-axis) ->A trend line is fitted to the data points using regression ->Helps understand the overall direction and strength of the relationship between Sales and Profit In summary, these scatter plots visualize relationships between variables, with additional features like regression lines, categorical colors, and varying point sizes to provide more insights into the data.
K-MEANS CLUSTERING MODEL
Run to view results
First, the dataset's numerical features are standardized using scikit-learn's StandardScaler. The sensitivity of the K-means algorithm to feature magnitudes depends on all features exhibiting a consistent scale, which is ensured by this preprocessing step. This procedure complies with accepted preprocessing standards for machine learning (Pedregosa et al., 2011).
After that, the elbow method is used to determine the ideal number of clusters. The elbow point indicates the ideal cluster count and provides information about the intrinsic structure of the dataset. It is calculated and plotted for different cluster numbers using the inertia (within-cluster sum of squares). For cluster count selection in K-means clustering, this method is a well-known heuristic (Thorndike, 1953).
Using the selected cluster count and the k-means++ initialization technique, the next K-means clustering run makes it easier to assign cluster labels to individual data points. According to Arthur and Vassilvitskii (2007), this method ensures more robust clustering outcomes by improving convergence and mitigating initialization biases.
After clustering, the cluster characteristics analysis is calculating the average values of the numerical features in each cluster to clarify the average feature distribution and help distinguish different cluster traits. These kinds of analyses are essential to exploratory data analysis and help to understand cluster behaviors (Tukey, 1977).
Plotted trends over time, profit distribution histograms, and scatter plots that show cluster separations in feature space are examples of visualizations that are essential to the cluster characterization process. These graphic depictions aid in intuitive interpretation and offer useful insights into cluster dynamics (Wickham, 2010).
IDENTIFYING CLUSTER CENTROIDS USING KMEANS CLUSTERING
Run to view results
Initializing KMeans: Firstly initializes the KMeans object with parameters such as the desired number of clusters (n_clusters=5), the initialization method for centroids (init='k-means++'), and a fixed random seed for reproducibility (random_state=42).
Fitting the Model: The KMeans object is then fitted to the standardized data (X_scaled). This process involves grouping data points into clusters based on their similarity and iteratively updating cluster centroids until convergence.
Getting Cluster Centroids: After fitting the model, the code retrieves the centroids of the clusters using the cluster_centers_ attribute. These centroids represent the average position of data points within each cluster.
Printing Cluster Centroids: Finally, the code prints out the obtained cluster centroids, providing a summary of each cluster's characteristics in terms of the original feature space.
SILHOUETTE SCORE, DAVIES-BOULDIN INDEX, AND CLUSTER SEPARATION
Run to view results
1. Silhouette Score: The Silhouette Score is a metric that measures how well each data point fits into its assigned cluster compared to other clusters. It takes into account both the cohesion within a cluster and the separation between different clusters. The score ranges from -1 to 1, where higher values indicate better clustering. In our case, the Silhouette Score is approximately 0.1616. While this value is positive, indicating that the clustering is better and it is relatively low. A score closer to 1 would suggest a more optimal clustering solution. 2. Davies-Bouldin Index: The Davies-Bouldin Index is another metric used to evaluate the quality of clustering. It measures the average similarity between each cluster and its most similar cluster, considering both the size and scatter of the clusters. The goal is to minimize this index, as lower values indicate better clustering. In our case, the Davies-Bouldin Index is approximately 1.781. In general, a lower Davies-Bouldin Index suggests better clustering. 3. Visual Inspection of Cluster Separation: To gain a visual understanding of the clustering results, a scatter plot is created. Each data point is represented by a dot, and the color of the dot corresponds to its assigned cluster. Additionally, the cluster centroids (cluster centers) are marked with red 'x' markers. This visualization allows you to assess the separation and compactness of the clusters in the feature space. In summary, the Silhouette Score of 0.1616 suggests that the clustering is better than random but may not be optimal. The Davies-Bouldin Index of 1.7891 provides another perspective on the clustering quality, but its interpretation depends on the specific context and comparison to other solutions. The visual inspection of the cluster separation through the scatter plot allows for a qualitative assessment of the clustering results.
FACTOR SCORES AND CORRELATION MATRIX
Run to view results
OVERALL ANALIZATION OF THE PROJECT
The report offers a comprehensive and insightful analysis of the dataset_Superstore.csv dataset using data analytics techniques. The analysis demonstrates the power of data analytics in extracting valuable knowledge from data, enabling businesses to make strategic decisions and optimize their resources. In the data exploration and cleaning phase, we began by importing essential libraries such as pandas, numpy, matplotlib, seaborn, and scikit-learn for analysis. The dataset was loaded using the read_csv() function, and an initial assessment was conducted using the head() and info() methods to understand its structure and contents. We ensured data consistency and accuracy by converting data types and cleaning columns. Missing values were handled, and duplicate records were removed to maintain data integrity. Outliers in numerical features were identified and visualized using boxplots, while summary statistics were calculated to understand the distribution and central tendency of numerical variables. Moving on to exploratory data analysis, we visualized total profits by product category and sub-category to identify high-profit areas and explored relationships between variables through scatter plots. Utilizing K-means clustering, we segmented the data into distinct groups based on sales and profit attributes, analyzing cluster characteristics and identifying cluster centroids to understand each group's distinguishing features. In the dimensionality reduction and factor analysis phase, Principal Component Analysis (PCA) was applied to extract latent factors and reduce dimensionality, with factor scores calculated and added to the original dataset to capture underlying patterns and relationships. A correlation matrix of the factor scores was computed to assess relationships between different dimensions. This comprehensive analysis provided valuable insights into the Superstore sales dataset, enabling stakeholders to understand various aspects of sales performance, identify trends, outliers, and high-profit areas, and make informed decisions. Moving forward, targeted strategies can be developed to optimize sales and maximize profits, focusing on high-profit product categories and sub-categories, understanding customer segments through clustering, and leveraging dimensionality reduction techniques like PCA for future decision-making processes. Further exploration may include predictive modeling, advanced clustering techniques, and continual monitoring and analysis of sales data to adapt strategies in response to changing market dynamics and consumer behavior.
RECOMMENDATION
To optimize the product portfolio, the company should prioritize products that generate the most revenue while considering discontinuing those with poor sales performance. This strategic focus will not only save costs but also better cater to customer needs by offering the most sought-after items. Additionally, analyzing the factors driving product sales can inform targeted promotional campaigns, such as offering discounts on popular items, which can attract new customers and foster loyalty among existing ones. Wickham, H. (2010)Operational efficiency is another key area for improvement, where streamlining order handling processes and optimizing product storage can save both time and money while enhancing customer satisfaction. Continuous monitoring of sales data and market trends is crucial for adapting strategies in real-time, allowing the company to remain agile and responsive to evolving customer preferences. By staying flexible and proactive, the company can maintain its competitive edge and sustain long-term success in the market.
REFERENCES
1. Alghushairy, O., Alsini, R., Soule, T. and Ma, X., 2020. A review of local outlier factor algorithms for outlier detection in big data streams. Big Data and Cognitive Computing, 4(2), p.8. 2. Smiti, A., 2020. A critical overview of outlier detection methods. Computer Science Review, 38, p.100306. 3. Teoh, T.T. and Rong, Z., 2022. Python for Data Analysis. In: Artificial Intelligence with Python. Springer, pp.127-148. 4. Bauer, J.M., Aarestrup, S.C., Hansen, P.G. and Reisch, L.A., 2022. Nudging more sustainable grocery purchases: behavioural innovations in a supermarket setting. Technological Forecasting and Social Change, 180, p.121731. 5. Prell, M., Zanini, M.T., Caldieraro, F. and Migueles, C., 2020. Sustainability certifications and product preference. Marketing Intelligence & Planning, 38(7), pp.840-852. 6. Fan, Y., Kou, J. and Liu, J., 2020, January. Research on the influencing factors of customer loyalty in offline supermarket under new retail model. In: Proceedings of the 2020 4th International Conference on Management Engineering, Software Engineering and Service Sciences. ACM, pp.103-108. 7. MacQueen, J., 1967. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. University of California Press, Berkeley, Calif., pp.281-297. 8. Rousseeuw, P.J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, pp.53-65. 9. Davies, D.L. and Bouldin, D.W., 1979. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, PAMI-1(2), pp.224-227. 10. Kruskal, W.H. and Wallis, W.A., 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), pp.583-621. 11. García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M. and Herrera, F., 2016. Big data preprocessing: methods and prospects. Big Data Analytics, 1(1), p.9. 12. Aloysius, J.A., Hoehle, H., Goodarzi, S. and Venkatesh, V., 2018. Big data initiatives in retail environments: Linking service process perceptions to shopping outcomes. Annals of Operations Research, 270(1), pp.25-51. 13.Wickham, H. (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3-28. 14.Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley. 15. Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027-1035). Society for Industrial and Applied Mathematics