Big Data analytics coursework 2
Greig Dighton
H00296387
Introduction
This report will aim to gain a further knowledge of the Superstore dataset with the goal of understanding how to increase their profits. This task will be completed by combing through the data to understand what is working well and what is not performing to the same standards. By doing so we will be able to identify areas that are in need of improvement when it comes to profitability. Once the issues have been identified we will discuss ways in which adjustments can be made, so that the company is able to become more profitable.
The findings of this report will also be shown in graphical form to allow for an easier viewing of our findings. This will allow us to visually show where the provided company has been losing out on profits and allow us to make suggestions as to what areas of the business they should be aiming to improve in.
Importing data
First and foremost we must upload the dataset to view any form of data. I used the head function to show off the first 5 rows of the dataset to ensure that this had worked properly and that I could see the data was there. Once this had been done I created a summary of the full data set in order to check whether all the features that are expected are correct in accordance with the companies specifications that were provided. I was soon able to check this off through the use of the describe function that listed the features and the numbers within each feature that lay within the dataset. This then allowed me to look at individual features to check what lay within them, and whether this was accurate with what the company provided with us. I first checked the categories feature and soon discovered that the company has 3 separate categories, Office Supplies, furniture, and technology, this fits in line with the specification as we were expecting to find 3 separate categories. I then proceeded to check the sub categories section to understand what lay within it and to check that there was the correct amount as per the specification. This turned out to be correct. I also made sure to look out for any outlier sub categories that may not belong, but they were all satisfactory. Once this was done I then proceeded to check the state category, checking there were no invalid states, none were misspelled, or there was a wrong amount. This all checked out well with 49 correct states, which sits in line with the specifications.
After I was happy that my dataset categories were up to standard and correct in accordance to the specification I began to look for outliers that lay within the data itself. This is where I was able to pick up on a few pieces of data that had appeared after using the describe function on the dataset. What I had found is that there were a few numbers that lay within categories that would skew the data should I choose to leave them in when conducting my analysis. This is why I made the decision to remove these few specific outliers. The first outlier I removed was for any sales number that was below 1. From viewing my describe of the dataset I picked up on this when I looked at the minimum value for sales. Since it is not possible to sell less than 1 of an item I decided to remove any sales from the dataset that were below 1, keeping everything else that was above 1. Furthermore, the second outlier I removed was in the discount column, as I had become aware that in the max row for discount there was a value of 1.3. I made the decision that this would be best suited for analysing the data and it may cause skewed results. I did however take note that this is something the company should be looking at. To remove this outlier I set out to keep only discount amount between 0-1, this then removed the outlying data from the dataset. This then allowed me to create a new cleaned dataset to use for the rest of my analysis.
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Identifying weak data
When identifying what data we want to look at from the data set we had to identify whether there were any unusual values that we do not want to include within the data analysis. To do this we had to look at whether the gaps between the 3rd quartile and the max value is much larger than the gap between the median and the 3rd quartile. If this was the case then the max is unusual and we may look to remove the values from the analysis. This method was also used to see whether the gaps between the 1st quartile and the min value is much larger than the gap between the 1st quartile and the median then the minimum value is unusual. In the case of the data set this method was used to investigate whether any of Sales, Discount, Quantity, and Profit.
When looking at what the data provided there were obvious values that appeared within all aforementioned categories. When looking at the number of values that were labelled unusual, the number came out to be larger than expected. For this reason I made the decision to leave this data in, as in context to this project I believe that the data that appears 'unusual' may be a cause for the lack of profitability. For this reason I decided to investigate these values more closely.
Firstly I investigated all order that were making a negative profits in the hope that there would be some form of pattern that would become apparent. I thought it would be best to work through the lowest performing orders first, and when I did so I stumbled upon some insights that were leading to the bad performance of these orders. The superstore has been giving out some extremely large discounts on some of their orders, with the worst negative profit order being given a 70% discount. This has led to this order making a -6599 dollar loss, and this pattern has been repeated with the majority of the remaining negative profit orders being given very large discounts. These discounts are also not justified through the customers buying in bulk, as the largest purchase quantity for any of the negative profit orders I could find was 13. It has essentially become apparent that the superstore has been giving out discounts far above what they can afford to be handing out. I investigated this issue further by creating a scatter graph of all the superstores orders for profit and the discount given on the order. The graph very clearly shows that all orders that are given 0 to very little discount are profitable for the superstore, and that when the discount amount is increased the profits drop significantly. There is an area on the scatter graph just after the 40% discount where it becomes apparent that orders are no longer profitable and they are now begining to make losses. At this time I decided to run a linear regression model to fully understand the impact of what is happening between the relationship between profit and discount. Once the linear regression was run there were very clear results showing that the expected value of the intercept was 67.7 when all independent variables are zero, essentially meaning that when there is no discount, the superstore is expected to make around 67.7 dollars in profit. Although the coefficient (in this case the discount) shows a value of -250.36, meaning that with every increase in the units of discount the superstore are expected to lose 250.36 dollars. This has a very negative impact for the business due to the size of the loss with each unit of increase of discount. This ultimately reiterates how the superstore needs to revise their discounts that they are giving out on their products as it is having quite a severe impact on the profitability of their business.
Through a further investigation into the data I was also able to find an anomaly when it came to the largest orders by quantity. The data showed that there were 3 orders which have sold a quantity of 10000 each, this is extremely far away from the next closest large quantity sale of 14. I decided it would be beneficial to look at these 3 individual orders more closely, and I was able to identify the items as: Hunt BOSTON Model 1606 High-Volume Electric Pencil Sharpener, Beige, Laser & Ink Jet Business Envelopes, and Ultra door pull handle. Considering the large amount of quantity of each of these products being sold I expected to see a large amount generated by sales. This although is not the case, which makes me believe that there has been an error from the superstore with their pricing. For 30,000 items sold combined between the 3 orders to only make a revenue of just 1202 it is evident that these items are anomalies, and need fixed asap. The error in pricing may have resulted in customers taking advantage of the mistake, leading to the high quantity sold within the order.
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Data analysis
To further understand the superstores problems and figure out ways to increase their profits we need to look at some more in depth data. For this reason I have produced a number of graphs that will allow for further data analysis to be conducted. Firstly I created a graph showing the total number of profitabale sales by category to understand which category is performing the best by profitability. To do this I compared two different graphs, one showing the profit by category, and the other showing negative profit (loss) by category in order to understand where work needs to be done. After analysing both graphs it is apparent that the furniture category is struggling to produce as much profit for the superstore as the other two main categories. From interpreting the graphs it is apparent that furniture is only making around 15,000 in profit, in comparison to technology which is making around 140,000 in profit, and office supplies which is making around 120,000 in profit. So now it is clear for the superstore to see that their furniture products are their least profitable. In order to understand why furniture is performing as bad as it is I created another graph to show the average discount on each category, to which I found that furniture had the highest average discount on products. So for the superstore to increase their profits, they need to look at reducing the discounts that they offer on furniture items, as it is costing themselves by handing out higher discounts.
Furthermore, I wanted to gain an understanding of what regions were performing best. For this I created two graphs again, one of profitable sales by region, the second of non-profitable sales by region. This way I will be able to understand if there is a region that needs work done for increasing their profits. By comparing the graphs it is evident that the central region will need to most work done for increasing profits as they have the second lowest number of profitable sales, and the most non-profitable sales by far. To further understand why this issue is as bad as it is I created another average discount graph, and the results showed that the average discount being offered in the central region for each of the 3 categories is significantly higher. So for the central region, and the superstore as a whole to increase profits, the amount of discounts offered need to be reduced significantly otherwise profits will not increase
The final graph produced shows the amount the superstore is losing due to returned items. This graph shows the amount of profit that is being taken away due to the item being returned back to the store. Items within the technology category seem to be the biggest issue with the superstore losing nearly 14,000 dollars in their profits due to their returned items. For this reason the superstore needs to try harder to ensure that all their products are up to the high standard promised, as this should hopefully reduce the number of returned items. With a lower amount of returned items the superstore should be looking at increased profits, as they are able to keep more of the profits, without returning the money to customers
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Summary
In summary superstore will be able to increase their profits by sticking to a few very simple objectives. First and foremost they must look at their rate of discounts allowed and come up with some form of solution that stops stores giving out discount on orders that end up creating losses for the company. There needs to be a cap on the discount, and all stores must follow this rule, as there are certain stores within regions that are significantly letting the company down in their efforts to maximise their profits. Secondly the superstore need to work on not having items returned to them, as this causes them to lose out on profit they have already made, due to this having to be returned to the customer. It would be beneficial for the superstore to run more extensive quality checks on their products, especially their technology products, in order for them to receive less returns. This will enable them to keep more of their profits, thus increasing their profit. Lastly, the superstore must look out for out of the ordinary orders, such as the orders of 10,000 quantity, as they may have made errors with their pricing. Mistakes like these can cost the business extensively in the long run if they continue to happen, so it is important that checks are undertaken regularly so that these types of orders don't persist and cost the company in terms of their profitability.