Importing the data
Here the Pandas and Matplotlib libraries are imported for data analysis and visualisation. Then the data is imported using the read_excel() function that can read excel files with different file formats.
Cleaning the data
No missing values can be found after checking using the isnull() function. The presence of missing values should be highlighted here.
Here is the summary statistics of the data for categorical variables. Since the default describe function does not show categorical variables the optional keyword argument of passing object to the include argument is done.
Here is the summary statistics for the numerical columns. It is vital to observe that the meaningful numerical values show the mean or standard deviation accurately. The columns such as Dates do not have such values.
Removing the outliers with the help of standard deviation. It is important to remove any type of outliers present in the dataset, which is done here by using the method of creating range using mean and standard deviation to create meaningful boundaries within which the variables Quantity, Profit and Sales would fall as these columns contain outliers (Berger and Kiefer, 2021).
Plotting the data
A bar plot of different product sales with respect to the category is provided as it is useful to understand what sells the most for the company. It turns out the most sold product category is Office Supplies. It is then followed by Furniture and Technological products.
Here the trend of sales from the range of which the data is collected is plotted as a line chart. This plot shows that there is a trend of sales from t he sales data. The sales grow increasingly more as the year ends. At the end of each year the sales spikes and then goes down at the beginning of each year. Another observable pattern is that throughout the years the company has improved their sales and has gotten more sales, which can be identified from the large spikes at the end of the plot.
Modelling an ML algorithm
The K-Means clustering algorithm will be used to model the data as it can find out clusters of data points that are closer in proximity and hence, infer hidden patterns in the data that are otherwise not apparent. The K-Means clustering is a sophisticated algorithm that finds data grouping effortlessly and can help find important features that will be valuable for the company's sales optimisation (Ahmed et al., 2020).
Here the encoding of the 'Ship Mode', 'Segment', 'Region', 'Category', 'Sub-Category', 'City', 'State', and 'Returned' features is done, while keeping the 'Sales', 'Quantity', 'Discount' and 'Profit' features scaled using the Standard Scalar. This is a necessary step that is beneficial to the K-Means algorithm as it would improve the quality of the modelling of the data and provide greater insights that will be beneficial for the company.
Here the K-Means clustering is done by first importing the necessary libraries and then focusing on optimising the cluster size for the algorithm. This is found using the Silhouette score, which is one of the best way to find the optimal number of clusters (Oktarina et al., 2020).
The optimal number of clusters found is 2. The features from the data frame contributing to maximum variance are 'City', 'State' and 'Sub-Category'. These are likely to be the most influential features contributing to 'Profit' of the company. However, for optimisation, the company should focus on profitable subcategories of the product and optimising discounts on different locations throughout their business. Hence, these should be optimised in order to improve the profit of the company.
References
Ahmed, M., Seraj, R. and Islam, S.M.S. (2020). The k-means algorithm: A comprehensive survey and performance evaluation. Electronics, 9(8), p.1295.
Berger, A. and Kiefer, M. (2021). Comparison of different response time outlier exclusion methods: A simulation study. Frontiers in psychology, 12, p.675558.
Oktarina, C., Notodiputro, K.A. and Indahwati, I. (2020). Comparison of k-means clustering method and k-medoids on twitter data. Indonesian Journal of Statistics and Its Applications, 4(1), pp.189-202.