About.
This project uses data from a medium sized beverage chain and aims to return a model trained on the profit column as a means of predicting the most efficient stores by state, market, day, month and product. I use the word 'efficient' as there are no individual product prices, meaning the sales data collected is in lump sums, a good medium for predicting max income / minimum sales effort. The initial plan was to create a price optimisation project where I could accurately return better prices for each product but the lack of individual product prices hindered that plan, so this was a good alternative. There were a good few challenges, mainly in the final model where VIF analysis returned more than one 'inf' value thanks to the similar products spread across stores in certain amounts etc., so a lot of effort went into experimenting with different algorithms which could work through the 'confusion' per-se. I settled on Principal Component Regression with LightGBM & multiple cross validation folds (plus more GPU time than I would have liked). Clustering sales groups by area code using KMeans resulted in some extra helpful insights.
A peek at the data.
A total of 20 columns with no null values The object columns don't seem to be too tricky (date, state, product, market) and won't return too wide a dataframe for the model if dummified.
Lowercase treatment / underscoring the column label spaces.
Statistical description.
• In the profit column we see a minimum figure of negative 638 as well as a negative 302 in the margin column, which could possibly be cancellations / stock loss.
• The mean profit is $61 with a std of 101.7.
• The mean margin is 104.3 cents per dollar of profit made, with a std of 94.
• The average marketing spend is pretty good, at 31.2 dollars with a low std.
• The inventory column has a relatively high mean of 749.4 which will be well-worth a closer look. I won't look too deep into the columns with 'budget' in the title here as they are target figures and therefore are of little relevance.
With these low-medium standard deviations we should see a lot of the sales / profit & marketing spend in the lower band, and with a mean profit of $61 it's looking like this will be a relatively profitable beverage company even including the losses.
Datetime treatment.
Using the first three characters of each day, converting the month to its string abbreviation, parsing the year as int.
New columns.
Creating a 'sum profit per area code' column as well as a total sales per area code. I don't have customer ids or anything of the sort and I would like a way of categorising certain financial metrics.
Analysis.
As discovered in the statistical description, profit, margin, sales, cogs, marketing and total expenses all have reasonable distributions. Here we see a similar, secondary, smaller peak in the right-hand tail of both sales and cogs, neither of which are issues due to them being representative of each other (the goods sold will have been paid for in the first place).
As well as having a very similar tail to that of the margin, data there's a slight plateau in total expenses around 20-25. Makes sense. So there is nothing out of the ordinary here requiring any serious further analysis besides the negative profit figures.
KDE plots of all columns.
Further evidence of certain values in negative territory here as well as similar negative values in the budget columns, which means these are all forecasted stock losses that will have been accounted for, or close to.
Very similar patterns in the total expenses and marketing columns, as to be expected. There aren't any values here that would give me any reason to believe there are any notable issues with any of the columns, some of the negative values might be greater than forecasted / budgeted and they will become apparent at some point in the EDA. Looking at inventory vs. margin, I see some clustered values on the left hand side of the chart which represent negative margin | high inventory price, so if these are high-priced losses then that could be the (or one of thee) issue(s).
1 - Budget values vs. actual values.
Margin.
The marketing forecast misses slightly in the 40-60 and 140-160 regions where actual gains have been made, as well as the < -200 region where some losses weren't accurately forecasted. With only a 3.4% difference in forecasted vs. actual, the budgets aren't far off the mark.
Sales.
It's a similar story with sales, with forecasts slightly higher in the lower (sub +80) range, and actual sales making up better ground than forecasted in the +80 to +100 range. There are more actual sales in the latter (> +500) regions than forecasted as well as some forecasts above +800 that didn't hit. Overall, actual sales are outpacing targets to the tune of 9.8%.
COGS.
Once again, actual values exceed those forecasted, with the actual COGS values floating around the 250 mark being around double those forecasted.
Profit.
The budget profit is almost on a par with the actual profit, with a few negative values unaccounted for in the forecast but slightly higher values than forecasted in the budget ranges that will make up for the negatives.
Budget profit vs. actual profit by state.
A rundown I included for each budget variable in the full project but I will only include a written description for the profit column in the public version.
• The California, Illinois, Iowa, Massachusetts, New York, Colorado and Florida stores were the key stores responsible for exceeding forecasts.
• Oregon and Nevada were almost on a par with forecasts.
2 - Sales.
Quantile analysis.
• The IQR for sales across all of the coffee shops is -95.0 to 425.0. This is -20 and +20 dollars of the targeted sales.
• The IQR for profit across all of the coffee shops, as gathered from the statistical description as well as being a [somewhat... citation needed] safe common assumption for this type of product, appears to reside in the lower price range. Here we see a figure between -95.5 and +204.5.
Sales quantile values.
Budget sales quantile values.
Profit quantile values.
Budget profit quantile values.
How many sales were recorded within the IQR?
• There is a grand total of 3842 sales between -115.0 and 425.0.
How many sales were recorded above the Q3 range?
• There are 406 sales above 425.0.
Largest average sales per day and month.
• The greatest percentage of sales fell on a Friday in July, a figure of 5.27%.
• A Wednesday in June saw the number two best sales, a figure of 5.09%.
• A Monday in August then saw the third-best day for sales, with a figure of 5.09%.
The summer months are the best months for business, with July, August and June leading the average sales figures:
Sales by state.
Of the ten most popular states, the split of profit (and their percentages) are as follows:
• California - 17.4%
• New York - 12.8%
• Illinois - 12.6%
• Nevada - 10.8%
• Iowa - 9.86%
3 - Profit.
Along with the concentrated cluster of sales between 0 and 200, here we are able to see the negative values a bit more clearly. This data can't tell us a great deal about what the negative values relate to exactly, although one thing we can tell from looking at this information is that the loss isn't too much to worry about from an absolute profit perspective. Any loss is an issue that should be analysed immediately, and there is an obvious issue here, but these stores aren't in grave danger of missing profit targets due to these negative values. This could be a matter of 'a couple of tweaks here and there'.
Profit vs. loss.
That being said, -32,992 is quite a large amount that still results in a (healthy but repairable) profit margin of 31.66%. Global averages for stores of this nature mostly range between 5% at the bottom end, 15% - 20% at the normal, up to 70% in extreme cases.
Average monthly profit over time.
• 2010 saw relatively low figures (consistently below 10K) when compared to 2011. It saw good trade above 9K from May up until September where trade began to slacken off below 9K, carrying on under 9K until December. Going into January of 2011 however, we see a 21% hike in profit on 2010's best month, July.
• 2011's profit remains steady above 12K from January onwards, seeing a max peak of 13.67K in July. This is quite a hike in annual profit, from a good run of between 8K - 10K through 2010 to an almost instant jump to 12K - 14K in the final months of that year.
Profit vs. cost of goods sold.
• This is also good news, with the cost of goods sold not exceeding 300 dollars very often. But there are some relatively low cost items creating some intermediate losses (cogs not exceeding 49.5, with -25.5 in losses for these items).
• The majority of cogs (1.98%) are below 100 dollars, with cogs between 100 and 199 dollars (1.01%) being the next most common bracket.
Profit distribution by state.
• Here we have a chance to gain some good insights into where the profit is coming from as well as the loss. It's easy to spot the majority of negative values stemming from two major stores and a handful of minor ones, chief of which are the New York store and the Nevada store.
First, the loss:
• New York's losses appear to consist of Mint Tea and Mocha, where Nevada's losses are primarily on Regular Green Tea.
• California sees losses in Amaretto coffee and Decaf Irish Cream.
• Missouri's worst assets are Lemon Herbal tea, Chamomile and Lemon Tea.
• Massachusetts' worst asset is the Caffe mocha.
• Utah's are the Green Tea, Mint Tea and Darjeeling.
Of the positive:
• Texas, Illinois, Florida and Oklahoma have no loss whatsoever. And with a minimum profit of only -3, I feel it would be unfair to exclude Iowa from this group.
Average profit percent per state.
• Illinois is the most profitable state with 11.6% of the profit.
• Massachusetts is the second-most profitable state, with 9.3%.
• California is the third-most profitable state, with 9.0%.
(When looking at it from a sum profit perspective, the order is California, Illinois, Iowa).
Profit distribution per market.
• The Central market appears to have the better outlook of all quarters, with the highest Q1, Median, Q3 and upper fence of all of the stores, very little in negative territory and a few healthy values above the upper fence.
• The next best performing market is East, with a slightly higher Q3 than West (but a slightly lower upper fence), more figures above the upper fence and a lot less negative profit than West's negative figures (-36.17% vs. West's -55.83%).
• Finally we see the South Market as the lowest earner. Although it is still bringing in healthy figures above the upper fence range.
Mean profit per market figures.
Loss per market figures.
Profit per market size.
The major market brings in the most profit, to the tune of 66.67% over that of the small market's combined sum profit.
States per market.
With the inclusion of the profit column we can see the spread of profit over the markets as well as the states within the markets. Those sum profit figures and the leading states per market are:
• Illinois (30.8K), Iowa (22.2K) and Colorado (17.7K) are the best performing Central stores.
• Texas (15.8K), Oklahoma (8.5K) and Louisiana (7.3K) are the best performing Southern stores.
• California (31.8K), Oregon (12.4K) and Washington (11.4K) are the best performing Western stores.
• New York (20.1K), Massachusetts (16.4K) and Florida (12.3K) are the best performing Eastern stores.
Market size per geographical market.
• The Central area contains 696 major and 648 small.
• The West area is the leader for small markets, contains 288 major and 1056 small.
• The Eastern area contains 552 major and 336 small.
• The South area contains 168 major and 504 small.
Considering the Southern and Western markets aren't the most profitable, the reason *could* be the market size.
Monthly budgeted profit vs. actual profit.
• We see actual profit falling short of projected targets in the months of October (-17%), May (-1.5%) and August (-1.3%).
• We see profit out-pacing projected targets in the months of December (13.2%), November (12.3%) and January (3.4%).
Month-on-month profit (pct. change).
• (The 'nan%' Jan 2010 value exists due to their being no previous data from which to deduct a % change).
• As seen in the time series analysis above, there is something of a quantum leap in profits between December of 2010 and January of 2011. Here we see the actual figure; 42%. From there, we don't witness much in the way of volatility. The next fluctuations in profit worthy of note are represented in June's figure of +7% along with May and September's lows of -4%.
4 - Marketing.
Marketing distribution per month and year.
• Was the leap in profit around December 2010 / January 2011 due to additional marketing spend? According to the data visualised below.... not exactly. It appears that the largest marketing spend exists in the months of July and August, followed loosely by June and December, all with what seems like equal values spent across all the months in both 2010 and 2011.
• So, marketing spend does see an increase in certain months where the profit has dropped slightly - possibly in a bid to counter the drop in sales -, mainly the months of July and August. If there are features in this dataset providing better insights than the marketing features, it could be 'total_expenses'.
Marketing spend per month per annum.
• As the features in the violin chart above look eerily similar for both years, I will check the actual marketing spend per month for both years. The result really is an exact monthly figure for both 2010 and 2011:
5 - Type.
The type distribution is a 58.9% split on Regular, and 41.1% on Decaf. The decaf products have 'decaf' in the name and there is no outstanding monthly data worth investigating, so I won't be including this column for analysis.
6 - Products.
The top ten best-selling products.
• These are the Columbian coffee blend (22.4%)
• The Lemon Tea (12%)
• The Decaf Espresso (11.9%)
• And the Darjeeling (11.7%)
Profit per product type.
• Coffee is the leading product type with a profit of 74.683 dollars, follwed by Espresso with a profit of 68,620 dollars.
• Of the Teas, Herbal tea is the leading product type with a profit of 63,254 dollars, followed by ordinary tea with a profit of 52,986 dollars:
Sum profit per product.
• Here we see the Columbian coffee profit sitting at almost 2x the profit of everything else. Oddly, as we saw in the violin plots for each state, the Green Tea is responsible for 0 profit in positive figures.
Average margin per product.
• The regular espresso, which is the best-selling product on a month-by-month basis, has the highest average margin. These items are all in the top ten best-selling products so the higher margins will equate to a better product all-round, or it will be a simple matter of supply and demand plus / besides advertising.
Product per state.
• The density heatmap below returns a white cell for the state of Nevada in the Green Tea row, representing the -10,980 dollar loss on this product. The next most common state for Green Tea losses is Utah, with a grand total loss of -207 dollars.
• In the Caffe Mocha row we see the lightest hue in the New York cell, representing a total loss of -6,354 dollars.
• The majority of the Mint Tea's losses are also coming from the New York store with a sum total loss of -4,780 dollars.
• The New York store is the fourth most profitable store responsible for 7.74% of the company's total profit, but also responsible for what appears to be the majority of the losses.
The best sellers per state.
• We see the Columbian Coffee slaying in Massachusetts, New York & California.
• The Decaf Espresso is doing well in California, Illinois, Oklahoma and Oregon.
• The Lemon Tea is doing well in NY, California, Iowa and Nevada.
• Darjeeling is doing very well in Nevada and Iowa.
• Chamomile is a top seller in Iowa as well as pulling in good amounts in California, Washington, Illinois and Colorado.
• Earl Grey is another mild surprise, doing very well in Iowa, NY, Ohio, California, Nevada and Oregon.
• Caffe Latte sees its highest figures in California, Oklahoma, Texas and Utah.
• Amaretto is popular in Colorado, Utah and New Hampshire.
• Regular Espresso sells well in NY and Massachusetts.
Best-selling products per state group.
• The Columbian coffee blend is the most common product in MA, CA, NY, TX and CT.
• Chamomile is the most common product in IA and WA.
• Darjeeling is the most common product in NV.
• Caffe Mocha is the most common product in IL.
• Amaretto is the most common product in CO.
Best-selling product (average) per month.
• The regular Espresso and the Columbian are the most common best-sellers each month.
• Columbian sales are the most prominent in the months of August, July and May, the best averages for rest of the year are all Regular Espresso.
Best-selling product (average) per day.
• Here we see Columbian as the best-selling product on a Sunday, with Regular Espresso being the preferred product over the remaining weekdays, possibly by the workers in a hurry to get their morning fix:
7 - Total expenses
Product and total expenses vs. sales.
• Here we see what could be some partial evidence of the Green Tea & Caffe Mocha's losses. Although not a definitive reason, I do believe it will play a part for these outliers; the total expenses required for the product to sell for these two products is relatively high, while the sales figures are relatively low. The Caffe Mocha is only the 7th most popular product and the Green Tea is not popular at all and both of these products' expenses are relatively high in comparison to the other products, so perhaps the losses stemmed purely from an attempt to push a backlog of unpopular stock. Merely a thought.
Total expenses per month.
Month-on-month change in expenses.
• There are visible patterns here that reflect the profit per month pct. change witnessed earlier. There are slight hikes in total expenses around or in the same month where we saw good pct month-on-month gains in profit, meaning the strategy involved in pushing certain products likely paid off.
Total expenses vs. sales by month.
• These scatterplots are representative of both 2010 and 2011 inclusive so the data for some months won't be 100% accurate if looking at these plots with a specific year in mind, the plan is only to get a loose idea of where the expenses are going in relation to the month-on-month profit and expenses.
• The difference in certain products' expenses is visible in December over January which will be sharing some values from 2010 no doubt. I thought there may be some instances here where some products ordinarily associated with winter conditions would have extra expense spared in that season's months, and the same with some light herbal teas in the summer months etc., but there doesn't appear to be a lot of product-specific, seasonal marketing here, if any.
Total expenses where monthly average sales made slight gains (May - June).
The Amaretto, Caffe Latte, Earl Grey and Green Tea didn't require much expense in one of the outstanding months (June), but other light products such as Lemon Tea, Mint Tea and Darjeeling did see some extra expense spared along with the coffees.
Total expenses where average monthly sales made large gains (Nov - Dec).
In the best month for total expense spent, December, we still see Amaretto chilling without making too much noise and the Caffe Latte, Green Tea & Earl Grey all see a reasonable amount spent compared to November (quite a lot more when compared to May).
Profit vs. sum total expense per product.
• The sum cost of selling a product rarely dips below 10K. On the three of occasions it does drop below 10K, the Mint Tea, Amaretto and the Regular Espresso are the products in question.
• The Columbian coffee sees the largest sum total expense at 31.37K, then the Caffe Mocha at 30.37K, and the Lemon Tea at 27.74K.
Average profit / expense by product.
• It seems the teas have the best balance of profit / expenses. The Regular Espresso is a good earner overall profit-wise but it has the highest total expenses of all of the products, like the Columbian. The Caffe Mocha possibly has one of the worst expense vs. profit rates with a total expenses figure of 63 and a profit of 32. Lemon Tea, Earl Grey and Chamomile all have profit values over 50 and total expenses under 60 so these really are worthy of note:
Total expenses per product type where profit is less than 0.
And a tad more evidence of the herbal teas doing better than some coffees in certain instances; less extra spending on these products where the figures are at a loss.
Total expenses by market size.
The major market is the better of the two in this respect, it sees a sum of 102.68K in total expenses where the small market sees a sum of 126.98K:
8 - Monetary loss.
Projected loss.
Overall loss by state.
As figured earlier, it is close between New York and Nevada. As we have already seen, NY does have better profits than most, but also heads-up the chart for max losses by a slight margin with a loss of 33.74% compared to Nevada's 33.44%:
Loss per product type.
This equates to a total loss of -14,688 dollars for coffee products and a total loss of -13,304 dollars for tea products.
Loss per product.
• The product responsible for the largest loss is the Green Tea at -11,291 dollars. This is followed by the Caffe Mocha at a loss of 7,056 dollars, then the Mint Tea at a loss of -5,506 dollars.
Loss by state per market.
New York's losses far better visualised here, as well as all of the Western market's losses in the California & Nevada stores.
COGS per state.
• The highest COGS amount per state starts with California, then we see NY, Nevada, Illinois and Iowa to make up the five most common states by sum.
9 - Clustering sales by area code.
The area codes in each cluster.
(Un-comment for further information):
The top five states residing in each cluster.
• Platinum states: Utah, Oregon, Nevada, Colorado, Iowa.
• Gold states: Washington, Louisiana, Wisconsin, Colorado, Oklahoma.
• Silver states: New York, Connecticut, Missouri, Wisconsin, Ohio.
• Bronze states: California, Massachusetts, New York, Connecticut, Illinois.
• Lead states: Florida, Ohio, Texas, California, Missouri.
• Tin states: Florida, California, Texas, New York, Massachusetts.
Negative profit per cluster.
Before anything I think it would be a good idea to check the negatives in each cluster. Each cluster has been ranked by sales which have zero negative values, although output will vary when profit is brought into the mix.
• Platinum rated areas have 17 negative profit values.
• Gold rated areas have 14 negative profit values.
• Silver rated areas have 22 negative profit values.
• Bronze rated areas have 13 negative profit values.
• Lead rated areas have 8 negative profit values.
• Tin rated areas have 14 negative profit values.
A closer look.
We can see here that the percentage of negative profit for the Lead clusters is around 14% greater than that of the Tin clusters, so we will see some differences in profit when those metrics are analysed. In fact, some Bronze profit values could outweigh some Silver values for this reason.
Total profit per cluster.
Total profit per cluster by month.
• There are positive fluctuations in the Platinum areas in December, August, June March and May.
• For sales in the Gold area codes, we see April, July, September and October.
• For the Silver clusters, the best months are February, October, November and December.
• For the Bronze clusters, the best months are August, November, June and October.
• For the Lead clusters, the best months are April, July, August, September, January and May.
• For the Tin clusters, the best months are December, June and May.
Clusters by market size.
• The Platinum area codes: 1566 small markets and 350 major markets.
• The Gold area codes: 609 small markets and 138 major markets.
• The Silver area codes: 275 small markets and 321 major markets.
• The Bronze area codes: 68 small markets and 291 major markets.
• The Lead area codes: 26small markets and 204 major markets.
• The Tin area codes: 0 small markets and 400 major markets.
Favourite product per area code cluster and market.
Grouping by market. The preferred products in these markets per area code cluster are as follows:
• Platinum: Caffe Latte in the West market, Amaretto in the Central and West markets.
• Gold: Caffe Mocha and Amaretto in the Central market, Caffe Latte in the South and West markets.
• Silver: Caffe Mocha in the East market, Amaretto in the Central market, Caffe Latte in the South market.
• Bronze: Caffe Mocha in the East market, Caffe Latte in the West and South markets.
• Lead: Caffe Mocha in the East market, Amaretto in the Central market, Caffe Latte in the South market.
• Tin: Caffe Mocha in the East market, Caffe Latte in the South market, Amaretto in the West market.
Monthly profit by state and cluster.
• Although California and NY have the volume, I will be looking at the (purple and blue) Platinum and Gold sales through the summer months first as these are the big (and easy) money features.
• There aren't too many standout states for these sale types; we see Illinois, Colorado, Iowa and Oklahoma making good ground for these area codes across the whole year. Illinois definitely has the best balance and sum of Platinum, Gold and Silver values across year, with Apr, Jul, Aug, Sep & Dec as its best months and its majority Platinum & Gold values make up June's sales.
• The distribution of area codes within the other states are plain to see here and it would be worthwhile making note of a few of them for the model where profit per product per state is concerned; the states with heavy Tin, Lead and Bronze clusters won't be *as* important unless there also exists a good selection of Platinum & Gold along with them, or they exhibit good volume all-round.
Monthly profit per product by state.
A good insight into the full story behind each state's income by product split. In the final model we should see some of the states in the graph below making the majority of their profits from coffees (espressos especially) as well as a mix of the Platinum and Gold codes from the graph above. Texas could be interesting... product-wise it bears all the hallmarks of what would pass for a profitable store, but those profits were made up from lots of Tin, Lead and Bronze sales as opposed to fewer, larger lump sums.
Dropping columns, one-hot encoding the categorical columns.
The OHE will result in column names such as 'onehotencoder__column_name' for the encoded columns or 'remainder__column_name' for whatever integer columns were passed over, so the dunderscore and everything before it will need to be deleted from the column names:
Setting target variable and X.
Converting X features to floats.
Principle Component Regression with LightGBM.
Due to the presence of 'inf' values discovered in the VIF analysis, PCA with 5-fold cross validation will be the order of the day.
Some PCA results explained, picked in no particular order.
• Illinois - Illinois has no loss, accounts for the largest average cut of profit (11.8%), the second-largest sum of the company's profit, spends less selling its products than most and contains a lot of Platinum area codes.
• Major market - The major market's resulting sum profit figure of +66% above that of the small market's profit figure makes this an obvious choice. Of the two market sizes, the major market is responsible for much less 'loss' as well as less in the way of total expenses / leans more toward selling cost-effective products.
• Eastern market - This area includes the majority favourite of Silver, Bronze, Lead and Tin products, has the second-highest mean profit and almost 20% less loss than the West market so its negative values here don't assume a great deal of importance.
• Herbal Tea (and Tea) - Mostly low expense and bringing in good profits in Platinum and Gold states such as Iowa, Cali, NY, Nevada, Washington & Oregon. Also in the Shap plot we see the negative values blessing the model with more importance than the positive, likely due to the Green Tea's losses as witnessed earlier in the EDA.
• Colorado - There are minor losses here and those losses are in the Herbal Tea area, belonging specifically to the Lemon Tea. Losses aside, this state consists completely of Gold and Platinum area codes with Amaretto & Chamomile as the best-selling products.
• California - The third-highest average earning store, the store with the largest sum profit of all stores, but is suffering a 6K+ loss hence the bulk of the importance in red straddling the zero line, and bears no Platinum area codes. California's profit dips below zero during one of the Summer months and its supposedly highest-earning clusters only do the best trade in the quieter (winter) months. It is the top state for Bronze, Lead and Tin clusters resulting in 97 area codes and makes up most of its profit from the 'high margin' Columbian and Espresso products.
• Oregon - Most of Oregon's profits come from Tea, Espresso and Herbal tea, all being Platinum sales. The important negative values for Oregon in the Shap chart below are due to Amaretto sales, some other states' most profitable product.
• Chamomile - Very little in the way of loss, is a high-profit, mid-expense product (the 4th largest expense figure of all products), appears to have been pushed in the summer months with a good month-on-month expense difference between May and June, along with being a highly popular product in Washington, California, Illinois, Colorado and Iowa.
• Caffe Latte - A medium profit, low expense product popular in California & Oklahoma among many others, spread across every market in every cluster.