Introduction.
This is another early project of mine that I completed in my first year of data analysis training, set to "public view" because it was pivotal in helping me learn most of the concepts I use now and I sometimes like to think, as data analysis projects go, it still holds a wee bit of weight.
Problem statement: "A leading superstore chain would like to understand which products, regions, categories and customer segments they should target or avoid."
The data.
Importing the data.
Dataframe head.
Unique values.
I will drop the Row ID and Country columns, keeping Customer ID for a bit more preprocessing. I can use the Order ID column for SQL queries and I will drop the other seemingly unnecessary columns ("State", for which there are postcodes / vice-versa) before modeling.
Statistical description.
Sales has a mean of 229, a max of 22638 and a standard deviation of 623, so that will be interesting to check out.
The Profit column has a minimum of -6599, a low mean figure and quite a high standard deviation which will definitely be worth a closer look.
And there are no null / missing features.
Data cleaning.
Lowercase and snake_case treatment of column labels.
Creating a column called "wait_time" which will be the difference (in days) between the shipping date and the date of delivery.
Setting day, month & year columns.
Splitting the customer id column, keeping the integer values so I can then use KMeans clustering to group customers into Platinum, Gold, Silver and Bronze based on sales.
And a total sales column.
Now a 'gross' column.
And a 'net' column.
Parsing the model name from each product name.
KMeans clustering for customer segmentation.
Assigning grades based on their cluster value.
Merging clustered data with the original dataframe on customer id.
Analysis.
Density plots for each numeric column.
• Sales: Almost every sale is below 5000.
• Quantity: Quantities under 10 are the most popular.
• Discount: Discounts are between 0 and 0.1, rising again at 0.2, with slight peaks at 0.4 and 0.7 / 0.8.
• Profit: Profit seems to be balanced around 0, possibly even at a loss.
Profit.
There seems to be an excessive amount of negative profit causing an ever-so-slight negative trendline, so i will begin with this negative data to find out why.
Profit, loss, and overall profit.
• The total profit is $286,397.
• The losses come in at a sum of minus $156,131, around 55% of the total profit figure.
This store could be making over $360K in revenue if it halved those losses (and ran at a realistic 25% - 30% profit margin). Time to take a look into where that revenue is going.
Negative profit data.
A query to select only negative profit features.
Data distribution by category for the negative profit data.
• 1: Office supplies
• 2: Furniture.
• 3: Technology.
Overall losses by brand name.
• Overall losses mostly belong to the Cubify brand. Their loss of -3,419 dollars per item is almost 40% greater than those of the Cromcraft products which sit at a loss of -1346 dollars per item, second to Cubify.
• Bevis is the brand causing the third-largest overall loss figure of -816 dollars per item.
Let's look into one or three of the biggest losers here.
Cubify losses.
Searching through the Cubify items in the negative profit dataframe shows only two items, so dropping these Cubify items from the itinerary in future would be the right idea considering they have a mean loss figure of -4360 dollars. If an enterprise buys three of each in a week, the store is looking at an average weekly loss of almost $40,000, solely on these products.
Cubify item statistical description.
Cisco losses.
There are 11 Cisco items in the negative profit dataframe causing mild to severe issues. The mean losses float around the -264 dollar mark which is still a 94% drop on the Cubify products' mean loss, but at a loss of $24 per item, these products are quite detrimental to the bottom-line.
Cisco item statistical description.
GBC losses.
There are 43 GBC products spread over 141 separate features in this negative profit dataset equating to an overall mean loss of -157 dollars. That's only a $3.6 loss per item but they're still problematic. The major loss comes from a few popular products that do the same job as one another (binding machines), so, dropping and swapping these for more reliable or more cost-effective items makes this an easy fix.
GBC item statistical description.
Correlation matrix for the negative profit dataframe.
The most important figure here looks like the discounts in the Office Supplies category. Considering several of the products negatively affecting profit are in the office supplies segment, could the discounts be solely responsible for the losses or is there something else happening?
Let's look at the sum neg profit for each segment...
Sum of negative profit and discount by category and segment.
Combined loss and discount for each Consumer, Corporate and Home Office segments:
Office Supplies:
• Office supplies are -56,614 dollars in the red, with Consumer discounts totalling 288.20, Corporate discounts totalling 166.60 and Home Office discounts totalling 92.20.
Technology:
• Technology is -38,578 dollars in the red, with Consumer discounts totalling 46.80, Corporate discounts totalling 27.60 and Home Office discounts totalling 16.00.
Furniture:
• Furniture is -60,935 dollars in the red, with Consumer discounts totalling 141.60, Corporate discounts totalling 77.60 and Home Office discounts totalling 42.78.
The statistics.
So the discount size doesn't necessarily give me any reason to think it is affecting the profit, unless we're only analysing the Office Supplies category.
Product name by sum of quantity and segment.
Here is a comparison between the top 40 biggest losers in product name with quantity and segment.
• The Consumer items are easy to see once again across all 40 products. The main culprits are the Chromcraft Round Conference Tables, the KI Adjustable-Height Table and the Global High-Back Leather Tilter in Burgundy.
• For the Corporate segment, it's the Bretford Rectangular Conference Table Tops, the Chromcraft Round Conference Tables and the Global Wood Trimmed Manager's Task Chair in Khaki.
• The Home Office segment's major losses stem from the Bretford Rectangular Conference Table Tops, the KI Adjustable Height Table and the Global Wood Trimmed Manager's Task Chair in Khaki.
As well as being in the Home Office and Corporate segments' three most common products negatively affecting profit, it is also the Consumer segment's fourth most common loser.
Negative profit by customer rating.
It seems the Platinum customers have the largest sum of negative profit. As per some similar BI projects I have worked on, this stands to reason as the customers generating the most profit will also be responsible for the most cancellations or returns.
Back to the normal data.
And these Platinum customers are by far the most profitable of all customers. As mentioned in a prior cell, these product woes shouldn't be too difficult to fix and once rectified, the monetary returns will be highly noticeable given the distribution of Platinum sales here. A highly rewarding outcome for the company for what will likely be an easy bit of investigation work.
Category.
Category by net price and sum profit.
• In the 'Sum of negative profit by category and segment' chart earlier, we saw Office supplies being responsible for the second largest sum of loss. Here we see Office Supplies responsible for the second-highest sum of profit, as well as showing the lowest sum of gross price.
• The opposite can (almost) be said for the category responsible for the largest sum of loss, Furniture. We see the lightest hue representing the lowest sum of profit and the highest peak for the highest gross price.
• Technology products, the lowest of all loss sums, straddle Furniture and Office supplies in the sum of gross price column, while holding the largest sum profit of the three categories.
Category by average profit per item and sum of profit per item.
• A clearer look into the above comparison shows Technology sitting way ahead of the other categories in both the average profit per item and the sum profit per item.
• Technology sees a sum of £38,775 and an average of £20.99 per item.
• Office Supplies, with a sum of £35,065 and an average of £5.81 per item, are making only 27% of the Technology item average, but respectable 90% of the sum profit per item.
• Furniture, with a sum of $4,105 and an average of £1.93 per item, is making 33% of the Office Supplies item average and 11% of the Office Supplies sum profit per item. This figure is 9% of the Technology average profit and 10% of the Technology sum profit per item.
Category by customer rating and average profit.
• The majority spread of sales are clear to see here, with Platinum customers' sales primarily existing in the Technology and Office Supplies categories.
• The Gold customers are seeing good profit figures in Office Supplies and Furniture, but are falling behind Bronze customers in the Technology category.
• The Silver customers are leading in the Office Supplies and Furniture categories, showing up last in the Technology category.
• The Bronze customers are the second-highest scored from a profit perspective in the Technology department but bring in the least in the Office Supplies and Furniture departments.
Ship mode.
Profit per item by ship mode and average wait time.
There is nothing out of the ordinary here, the most common shipping type is standard class and that class holds the highest average wait time. The other classes' wait times appear to be in accordance with their associated class; Second class has the second-highest average wait time, First class has the third-highest average wait time and Same day has the lowest average wait time.
Considering a lot of these items are office supplies that will be re-ordered, it might not be a bad idea to encourage some customers to use second class postage. Standard class sees 8,829 dollars per day profit per item, the same amount of goods in transit via Second class would see 13,795 dollars per day profit per item, so if some of these customers were to reorder frequently, those few extra days saved in shipping could reflect on certain repeat orders made in quick succession, at very little extra cost to the customer if the customer is prepared to wait a day or two more.
Customer ratings.
Total profit by customer rating per state.
• The Platinum members bringing in the majority of the profit (a sum total of 88,246 dollars) are purchasing from New York and California. Washington, Indiana, Michigan and Georgia's Platinum store sales then bring in a combined total of 52,826 dollars.
• The Platinum stores responsible for the most negative profit are Texas, Ohio and Illinois, totaling -34,939 dollars.
Total profit and loss by customer rating.
Eyeballing the above chart, I can see a mild difference in customer types. Mentioned earlier in the EDA was that certain customer types who were responsible for good profit sums were also responsible for a proportionate size of monetary loss, but with that's different with the Silver sales. It seems like there's a distinct lack of negative profit associated with sales of this rating and as a result, they appear to be bringing in a shade more total profit than the Gold customers. Let's check:
Platinum:
Gold:
Silver:
Bronze:
Sum of sales by profit per item, product name and customer rating.
So the Silver customers do have a higher P/L ratio than that of the Gold customers. In that case, let's see if the Gold customers are the ones buying the Cubify items / any of the other items in the loss dataframe.
It does seem that the Cubify items were bought by the Gold customers, along with another serious offender, the GBC electronic binder. So, a few distinct patterns emerging so far which are all being narrowed down to the same (products).
Discounts by customer rating per state.
It's interesting that some of the states with the largest losses for Platinum shoppers also have the highest discounts. My gut feeling at one point in the EDA was that this could be the case, and the losses were a direct result of the discounts but this could be an uncorrelated phenomenon. As they say, there is only one way to find out...
Discount sum per customer rating.
• The Platinum customers who are seeing a -67% loss in profits compared to Gold customers are also seeing a +70% discount sum compared to their Gold counterparts.
These figures are close enough to encourage me to look a little further into the discount variable.
Discount by category and customer rating visualised.
• The chart below shows Bronze shoppers claiming more discounts than Silver shoppers in the Office Supplies and Furniture categories, and the Platinum customers being leaps and bounds ahead of the rest in all categories.
• Across categories, Platinum customers see over double the amount of discount of the next-lowest sum of discount by customer ratings (be they Bronze, Gold or Silver).
This has wrapped up one of my questions; With Silver seeing more discounts than Gold as shown in the chart, Silver's reason for turning more profit than the Gold customers may not be correlated to their discount amount.
Customer rating by state and segment.
To go a little bit deeper into this demographic in Missouri, three out of four customer types ordered from the Corporate segment, the remaining customer type ordered from the Consumer segment. So with a little more analysis of this chart below and the two above, creating a strategy for better targeting using purchasing information in these states should present very little in the way of challenges.
One more thing I would like to check...
Sum of sales by profit per item, product name and discount.
To be 100% sure, I thought I would check to see if the items in the negative territory carried a heavy discount figure.
The size of the markers in the charts below represent the size of discount on the product in question. The first of the charts pulls data from the normal dataframe, where the chart below that represents data from the negative profit dataframe. For the discounts to be even slightly correlated to the negative profit we would have to see the majority of large markers in the lower half of the bottom chart, which, for the most part, we don't witness here. However, we do see - once again - the presence of discounts in the products most negatively affecting the profit margin.
Quantity.
Quantity by sum of profit per item.
• Customers buying in quantities of 2 and 3 account for the largest sum of profit per item.
• Quantities 4, 5, and 1 then follow.
• Quantities above 9 don't show particularly high profit figures, with the highest profit per item greater than quantity 9 belonging to quantity 13 at 488 dollars.
Average quantity by state and average profit.
If we're taking the average quantity into account, to gauge the states with the best profit per quantity we would have to look for the bars peaking around 4 or slightly above, with the darker hues. So a bit of balance is required here. States like Vermont, Washington, Oklahoma, Nevada, Indiana, Wyoming, Utah, Rhode Island are all important in this regard.
Partly, it seems some of those states have some of the least discounts and therefore the products which are negatively affecting the profit may have been on sale in these states.
Looking at the sum of quantity per state, we see the deeper hues in the longer blocks representing the majority of discounts residing in the states where the majority of sales have taken place. This is not an uncommon phenomenon, but Texas being the leading store for discounts (as opposed to California, the leader of quantity sold) is a little bit of an outlier as far as 'ordinary' sales data goes.
So, added to the other data witnessed thus far, Texas is most definitely a store worth looking into.
Average quantity by state and category where zero discount is applied.
Looking at this chart, we should get a bit closer to the better-earning states. Considering the Furniture category is one of the worst earners, I would be looking at the states with Furniture as the least significant value along with an average quantity in the 4-5 region. A state like Utah will still score well due to its tech products outweighing its furniture products. I mentioned Rhode Island above, and here I see that Missouri could be a 'sleeper' state worthy of note, as could New York.
Customer rating by state and profit per item where discount is zero.
A little more food for thought on making up for lost profit. Here we see some good states for customer retention such as:
• Missouri with its Gold customers purchasing Technology products at a good profit per item, Platinum and Silver customers purchasing Office Supplies.
• District of Columbia has Platinum customers with a good average of profit per item buying Tech products which should outweigh the Bronze customers buying the Furniture.
• Wisconsin could be worthy of note due to its Platinum customers purchasing Technology items and its Gold customers purchasing office supplies, even if some Silver customers did buy Furniture items.
• The Platinum customers (or customer) purchasing Tech products in Maine will be an important one.
• Washington also strikes as another important state for customer retention, with a good spread of Platinum, Silver, Gold and Bronze shoppers buying Tech products.
Sub-category.
Sub-categories and items making an impact.
• Technology: Copiers are the best-selling item, followed by phones and accessories.
• Office Supplies: Paper is the best-selling item, closely followed by binders, then storage and appliances.
• Furniture: Chairs are the best-seller, followed by furnishings. Tables are visible here again as the item causing the losses along with book cases.
Customer rating by state and subcategory.
• Judging by the chart above titled 'Customer rating by state and profit per item where discount is zero', I thought New York could be a highly important state for customer retention, however, returning to the normal dataframe and with the addition of the subcategory, we can see that NY Bronze customers purchased supplies, chairs and furnishings which will cause a slight drop in value.
• There are a handful of states seeing high-value purchases from Gold and Platinum rated customers. Missouri's Gold and Platinum customers are purchasing copiers from the Technology dept., while Bronze customers are buying paper and accessories along with the Silver customers. Vermont is an interesting state with Platinum, Gold and Silver customers buying Technology accessories and storage from the Office Supplies category, so I think that along with the more obvious states like Texas, certain low-key states like Vermont and Missouri (to name but a couple from a few) would be good to look into from a retention perspective as well as information gathering for future consumer targeting strategies. I will focus my final feature importance model on these metrics to be certain.
Day.
• Monday (1871), Friday (1818), Sunday (1710) and Saturday (1655) are the best days for sales.
• With 1463 sales, Thursday is the fifth best day, Tuesday follows with 1106 sales, but Wednesday is the clearest low-earning day with only 371 sales.
Discounts and average profit per item by day.
• The greatest average discounts, in descending order, are seen on Saturday, Thursday, Sunday and Tuesday, then Monday and Wednesday.
• The lowest profit per item, in ascending order is seen on Saturday, Monday, Friday, Thursday, Tuesday, then Sunday and Wednesday.
• The day with the lowest customer count (Wednesday) has the lowest average discount but the highest average profit per item. From there, we don't see anything else worthy of note if we are still looking at this from a "loss by way of discount" perspective.
• The three most common shopping days don't appear to have an unusually high average of discount or profit, although the most common shopping day (Monday) has the second-lowest average profit per item, which is worth checking due to there being some scope for targeting a wider array of customers on busy days.
Month.
• Like many sales datasets I have seen, the latter half of the year is the most popular time for shopping. September, November and December in particular with this data. Although October is the next most common month, its figure is only 59% that of the next most common month, September.
• And similar to many business / sales data again, January sees a dip in revenue due to the preceding Christmas period (assuming this data is from a part of the world where Christmas is celebrated), along with February where shoppers are more frugal, paying off their credit cards from Christmas etc.. March, April and May all see average figures for a superstore selling mostly technology products to corporate buyers.
Discounts and average profit per item by month.
• The best profit per item figures are in March and Feb, both months with below average discount figures. Then we see October in tow after that. The month with the highest customer count is November, consequently a month with a relatively high (5th largest) average discount.
Average discount per month by category.
• Furniture sees the highest average discount amount in June, October, May, August, March, Feb.
• Office supplies see the highest average discount amount in April, May, July, June, November, October.
• Technology sees the highest average discount amount in January, March, December, October.
Time analysis.
Sales over time.
It seems as though there was a notable drop in sales around the middle of 2014 going into July of 2015 which, looking at the time-series analysis below, could have been a result of the drop in Technology sales. Overall though, it looks as if sales were at their best in the later months of 2014 and 2017.
Net price fluctuations by subcategory and date.
Net price hikes over the years:
• Accessories, Appliances, Copiers, Furnishings, Labels and Paper have all seen almost continuous price hikes since 2014 so these products would be well worth keeping an eye on in future. Obviously these will be due to the cost of goods and year-on-year inflation etc., but good supply chain management is usually the go-to realm for economy-proofing the top line.
• Bookcases have seen a drop in price since 2015 without too much of a price hike since.
• Copiers have levelled out quite nicely in the final year. So, two subcategories worth looking into potentially stocking more of.
• Supplies, Storage & Tables follow a similar price distribution to one another over the years, dropping in 2015 also before taking off into 2017.
Sum of subcategory sold year-on-year.
A simple time-series analysis of each category.
• We see a steady upward trend in the overall sales for Technology and Office supplies. Furniture sees a few peaks in sales in the latter months of each year, but overall its trend is quite flat in comparison to the other two.
• A lot of the peaks in every category seem to coincide with the discounts per month. So perhaps the discounts were applied in the face of stagnating or deteriorating sales.
Furniture.
Office Supplies.
Technology.
Profitable and popular items.
Sub-category by state and sum profit.
A quick look into the most profitable stores' most popular subcategories returns:
• Accessories for California.
• Binders for New York, Michigan and California.
• Copiers for Washington, Indiana and California.
• Machines for New York.
• Paper for California.
• Phones for New York.
Profitable items over 20%.
Considering the losses on display here I thought I would begin by creating a dataframe using values from the net_price and profit_per_item columns so I can list and visualise the products that are bringing in 20%+ profit. I chose 20% because that isn't only a realistic, obtainable goal for most enterprises, but should be quite an easy target to hit for a superstore of this nature.
The top 40 products creating 20%+ profit.
So here we have them. The number one product, funnily enough, is an electric binding machine that isn't a GBC product, so it would be well within the store owner's interest to drop certain GBC items (or GBC as a whole) while there are currently similar products in-stock already turning a good profit.
Product brand / make by sum of profit per item.
• The Canon brand is by far the best-seller with a profit per item figure of 11,042 dollars. It's then Xerox with a figure of 6,432 dollars per item, 58% of the profit the Canon products make. The third-most outstanding brand appears to be Fellowes with its figure of 5,208 dollars per item.
• An interesting one is GBC, the brand we saw in the loss section. Its overall profit per item is 3,374 dollars, so their products are still making a slight profit. But the fact still remains that their items can be replaced with a different brand of in-stock items, which will make much more profit than the GBC brand.
Whether GBC items remain in-stock due this being a brand that is a leading brand that consumers gravitate toward might be an issue for the board room though!
Net Price.
Net price vs. profit.
I will only compare the net price column with the profit column because we have already seen where the majority of loss and gain stems from and the profit variable will be the target in the final model. I think I will include net price in the model as we can see a linear relationship between the two variables here with the profit gained rising in unison with the net price of the item sold, as is the case with most product sales, give or take.
Product recommender.
Product recommendation system based on the best-selling item bought by Platinum customers.
As much as I would prefer to bring this up-to-date using something like pyECLAT, DeepNote is freezing due to the size of this project. So I will be using the pivot table and a bit of grouping and sorting to build a product recommender based around the Platinum customers' highest-rated products.
Grouping products by product name and sales.
The new "number of sales" column data:
Creating the pivot table with the customer rating as the index:
There are no missing values, but just in case one slipped past the net:
Assigning the most common product purchased by Platinum customers to its own variable:
Finding the most correlated products to the Canon Image Class by user rating, using the sales matrix and ".corrwith()" method:
Joining that data on "number of sales":
In the left column we can see the correlations to the Canon Image Class 2200 Advanced Copier, in the right-hand column we can see the sales counts for each item:
Recommended items for Platinum customers who bought the Canon Image Class 2200 Advanced Copier, with values sorted in descending order.
• Any item with a correlation of '1' will be a top recommended product.
• Hover over the product name preview for the product name in full.
• (There are 176 pages of products here but only the first page will render in the final public Deepnote project).