Introduction.
Subscribers are crucial for a lot of businesses as they represent a loyal customer base that can provide consistent revenue streams. They are more likely to engage with the brand, make repeat purchases, provide valuable feedback, aid in predicting future sales and inventory planning. Additionally, they can be targeted for marketing campaigns, promotions, and new product launches, making them an essential part of a company's growth strategy.
Personal note: This was a day-long job to outline some methods and techniques I apply to in-depth analysis projects. Projects I host publicly normally represent a simplistic, clean "data analysis" style of work which I undertake for fun or to test out new skills i've learned that day, so I thought it would be a good idea to put my more serious data science-y hat on for a change. For this purpose I couldn't post any sensitive company data so I have used a regular, publicly available business dataset, most of which aren't big or even that good / detailed enough for my liking... So to the trained eye, you are correct; what you see here is definitely a bit of overkill for this dataset.
The data.
And data cleaning if required.
The customer id features are all distinct meaning that particular column can be dropped.
And apologies to my friends across the pond, but on a scale of 1 - even... I just can't:
The dataframe consists of 3900 rows and 19 columns, all data is present with no additional imputation or cleaning. I have added a 'geo' column and coordinates for the visual mapping of customer location data.
Statistical report & table information.
1: Age. A max age of 70, a min age of 18, an average/ age of 43.1 years, a std of 15.7, an IQR of 28.0 and a kurt of -1.26, meaning a slight lean toward the younger customers. Numerical associations with this column exist in the review rating and previous purchases columns. Strong categorical associations worthy of note are in the colour, season, shipping type, payment method & purchase frequency columns. Fashion trends ahoy!
2: Gender. 68% Male and 32% Female. These features provide the most information on discounts, promo codes, and subscriptions. Nice. There are correlations with purchase amounts and review ratings here.
3: Item purchased. 25 distinct products. "Other" is the most purchased item at 25% followed by Socks at 8%. This column provides information on location, and gender and bear an obvious correlation to review rating.
4: Category. 4 distinct categories: Clothing (46%), accessories (30%), footwear (17%), outerwear (7%).
5: Purchase amount (in USD). 79 distinct values. A max of $100, a min of $20.0, an avg of $59.2, an IQR of 42.8 and a kurt of -1.30 which again rests mildly in the lower range. The two most frequent values are $36 (2.8%) and $22 (2.6%).
6: Location. 50 distinct. The top hit is "Other" at 52%, followed by Louisiana at 4%. Providing information on colour, item purchased and payment method. There are high correlations with age, previous purchases and review rating columns but whether that is purely causal or there are clusters of locations responsible for more reviews / purchases will be figured later.
7: Item size. 4 distinct sizes: S, M, L, XL. M being the most frequent at 43%, followed by L at 29%.
8: Item colour. 25 distinct. "Other" again has the top value count at 24%. Black follows with 7%, then oddly, Peach (?!) at 6%. There are strong correlations to the age column here which explains a bit. Somebody asked about whether this is a real or synthetic dataset earlier, and I would say that the presence of peach coloured clothing and a younger customer age could well prove that this is real customer data.
9: Season. 4 distinct in total. The usual.
10: Review rating. 26 distinct values ranging from 2.5 - 5, with an average of 3.74, a Q1 of 3.2, an IQR of 1.10, and a std of 0.669. So, overall not too bad but still leaning a little into the lower rating territory.
11: Subscription status. 2 distinct: Yes (29%) and No (71%). Those are rookie numbers.... we need to bump those numbers up.
12: Payment method. 6 distinct: Credit card (18%), debit card (17%), PayPal (17%), Venmo (16%), cash (16%) and bank transfer (15%). These features primarily give information on location and colour, possess strong correlations with review rating & age.
13: Shipping type. 6 distinct: Free shipping (19%), express (17%), standard (17%), store pickup (17%), next day air (15%) and 2-day shipping (14%).
14: Discount applied. 2 distinct: Yes (45%) and No (55%).
15: Promo code used. 2 distinct: Yes (45%) and No (55%).
16: Previous purchases: 50 distinct, which is pretty good news for this store. An IQR of 23.0 and a skew of -0.023. Strong categorical associations with location and colour. The two most frequent values are 24 (3.6%) and 31 (3.6%).
17: Preferred payment method. The same 6 distinct as the payment method column but in a different order: Debit card (21%), bank transfer (17%), cash (17%), credit card (15%), Venmo (15%) and PayPal (15%).
18: Purchase frequency. 7 distinct features: Every 3 months (17%), quarterly (17%), bi-weekly (16%), annually (13%), monthly (13%), weekly (12%) and fortnightly (12%). These features give the most information on location, item and colour. Bears strong correlations with review rating and age.
Analysis.
Customer counts per location.
The East Coast, E. and S.E. regions represent the majority purchasing power. Let's take a closer look at those.
Those states from the East and South East regions with the most purchases and their purchase percentages:
• 1. New York: 8.17% • 2. Delaware: 8.08% • 3. Maryland: 8.08% • 4. Georgia: 7.42% • 5. North Carolina: 7.32% • 6. Connecticut: 7.32% • 7. Maine: 7.23% • 8. Virginia: 7.23% • 9. South Carolina: 7.14% • 10. Massachusetts: 6.76% • 11. New Hampshire: 6.67% • 12. Florida: 6.38% • 13. New Jersey: 6.29% • 14. Rhode Island: 5.92%
Subscriptions by gender.
Subscriptions by gender reveals 0 Female subscribers.
Subscription percentage per purchase, age and location.
The subscribers' locations and items with a 100% subscription rate. Once again, heavy on the Eastern side of the country.
Items sold per location.
AZ and Kansas are lacking a little compared to the other locations, but overall there's a healthy amount of items in each range being shipped.
Descriptive statistics.
I'll perform a time series analysis to identify any trends or seasonal patterns in the purchase data.
The ANOVA test results show a p-value of around 0.0106, which is less than the typical significance level of 0.05. This indicates that there are significant differences in purchase amounts across different seasons, primarily Winter and Autumn.
Now a Chi-Square test to examine the relationship between some categorical variables, starting with the relationship between "gender" and "subscription_status":
Off to a good start. This test results show a chi-square statistic of 676.79 and a p-value of roughly 3.3, indicating a significant relationship between "gender" and "subscription_status".
"discount_applied"'s influence on "subscription_status":
The test results here show a chi-square statistic of 1908.92 and a p-value of 0.0. This indicates a significant relationship between "discount_applied" and "subscription_status".
Two very good features to bear in mind for the end model of this project.
Combining contingency tables.
Male customers and applied discounts bearing relevance to subscription in orange:
OLS.
Next, time to perform a regression analysis to predict purchase amount based on age and previous purchases.
The results show that the R-squared value is 0.000, indicating that the model explains almost none of the variability in "purchase_amount_(USD)".
Further statistical test results.
(T-test and Chi-square)
1. Correlation Between Age and Purchase Amount: • Correlation Coefficient: -0.0104 • Conclusion: There is a very weak negative correlation between age and purchase amount. 2. Chi-Square Test for Payment Method and Subscription Status: • Chi-Square Statistic: 2.6056 • P-value: 0.7605 • Degrees of Freedom: 5 • Conclusion: There is no significant relationship between payment method and subscription status. 3. Chi-Square Test for Age and Season: • Chi-Square Statistic: 157.26 • P-value: 0.4567 • Degrees of Freedom: 156 • Conclusion: There is no significant relationship between age and season. 4. Chi-Square Test for Colour and Season: • Chi-Square Statistic: 64.65 • P-value: 0.7186 • Degrees of Freedom: 72 • Conclusion: There is no significant relationship between colour and season.
Separate regression analysis returned an R-squared value of 0.001, indicating almost no variability in "review_rating". The p-values for "age", "previous_purchases", and "purchase_amount_(USD)" where all greater than 0.05, indicating that they are also not significant predictors of "review_rating".
ANOVA test results returned a p-value of 0.0973, indicating that there were no significant differences in purchase amounts across different locations.
And finally, parametric testing (Mann-Whitney U Test) results: There was no significant difference in review ratings between customers who have a subscription and those who do not, nor any significant difference in purchase amounts between customers who have a subscription and those who do not. Also no significant difference in purchase amounts between male and female customers was uncovered, and there is no significant difference in review ratings between customers who used a promo code and those who did not. Although one or two of these results could differ with the introduction of more associated variables (in the final model) as far as the subscription status features go.
Clustering.
Hokay, let's perform a clustering analysis to identify groups of similar observations. We'll use K-Means clustering on the numerical features.
Elbow plot:
Based on the plot, cracking on with the optimal number of clusters.
Cluster means:
Cluster information: Cluster 0: • Younger customers (average age: 30) • Higher purchase amounts ($76.62) • Slightly higher review ratings (3.77) Cluster 1: • Middle-aged customers (average age: 44.81) • Lower purchase amounts ($37.01) • Slightly lower review ratings (3.72) Cluster 2: • Older customers (average age: 57.33) • Highest purchase amounts ($79.49) • Highest review ratings (3.78) Next, visualising these clusters using a scatter plot of "age" vs. "purchase_amount_(USD)" coloured by their corresponding cluster labels:
PCA results:
Mapping the column labels back to the category codes:
The category codes have been mapped back to their original names in the cluster summary, so let's see if anything of any significance occurs here...
Cluster 0: • Count: 1737 customers • Average Age: 44.81 years • Gender: Predominantly Male • Most Purchased Item: Jewellery • Category: Clothing • Average Purchase Amount: $37.01 • Location: Maryland • Size: Medium • Colour: Silver • Subscription Status: No • Payment Method: Debit Card • Shipping Type: Standard • Discount Applied: No • Promo Code Used: No • Previous Purchases: 25.18 • Preferred Payment Method: PayPal • Frequency of Purchases: Monthly • Latitude: 39.446158 • Longitude: -92.954077
Cluster 1: • Count: 1065 customers • Average Age: 57.33 years • Gender: Predominantly Male • Most Purchased Item: Shirt • Category: Clothing • Average Purchase Amount: 79.49 • Location: Mississippi • Size: Medium • Colour: Green • Subscription Status: No • Payment Method: Credit Card • Shipping Type: Free Shipping • Discount Applied: No • Promo Code Used: No • Previous Purchases: 25.92 • Preferred Payment Method: PayPal • Frequency of Purchases: Every 3 Months • Latitude: 39.536407 • Longitude: -94.002800
Cluster 2: • Count: 1098 customers • Average Age: 30.03 years • Gender: Predominantly Male • Most Purchased Item: Dress • Category: Clothing • Average Purchase Amount: $76.62 • Location: Montana • Size: Medium • Color: Violet • Subscription Status: No • Payment Method: Venmo • Shipping Type: Express • Discount Applied: No • Promo Code Used: No • Previous Purchases: 25.08 • Preferred Payment Method: Credit Card • Frequency of Purchases: Quarterly • Latitude: 39.668399 • Longitude: -94.228815
Distribution of items purchased per cluster.
The three most common items purchased in each cluster:
All items purchased across all present clusters:
Next, let's visualise the distribution of the "category" features throughout each cluster, in order:
• 1: Clothing
• 2: Accessories
• 3: Footwear
• 4: Outerwear
Important cluster information / business sense.
The most important business information for making profit from the clusters is: 1. Cluster 1: • Highest average purchase amount: $79.49 • Highest total purchase amount: $84,656 • High review rating: 3.78 • Moderate subscription rate: 26.95% • Moderate discount and promo code usage: 42.54% 2. Cluster 2: • High average purchase amount: $76.62 • High total purchase amount: $84,133 • High review rating: 3.77 • Moderate subscription rate: 26.14% • Moderate discount and promo code usage: 42.53% 3. Cluster 0: • Lowest average purchase amount: $37.01 • Lowest total purchase amount: $64,292 • Moderate review rating: 3.72 • Highest subscription rate: 27.58% • Highest discount and promo code usage: 43.58% Key insights: • Focus marketing efforts on Cluster 1 and Cluster 2 as they have the highest average and total purchase amounts. • Improve subscription rates in Cluster 1 and Cluster 2 to increase recurring revenue. • Maintain or improve review ratings to ensure customer satisfaction and loyalty. • Monitor and optimise discount and promo code strategies to balance between attracting customers and maintaining profit margins.
The implications for potential marketing strategies based on current findings:
1. Strategies shouldn't rely on age and previous purchases to predict purchase amounts. 2. Instead, more focus should be on other factors such as customer preferences, product features, and promotional activities that haev a higher probability of influencing purchase decisions. 3. Consider segmenting customers based on other meaningful attributes such as purchase frequency, review ratings, and preferred payment methods. 4. Utilise clustering analysis to identify high-value customer segments and tailor marketing campaigns to their preferences and behaviours after more data has been collected. On that note: 5. Explore additional data sources or collection methods that would aid in the better predicting of purchase amounts and enhance the effectiveness of marketing strategies.
Product analysis.
Results:
Cluster 0: This cluster has the highest counts for most items. Popular items include: • Coat: 83 purchases • Socks: 78 purchases • Sweater: 78 purchases • Blouse: 76 purchases • Shorts: 74 purchases • Handbag: 73 purchases • Hoodie: 72 purchases • Skirt: 72 purchases Cluster 1: This cluster has moderate counts for most items. Popular items include: • Dress: 51 purchases • Socks: 48 purchases • Sunglasses: 48 purchases • Scarf: 47 purchases • T-shirt: 46 purchases • Handbag: 45 purchases • Blouse: 44 purchases • Boots: 40 purchases Cluster 2: This cluster has the lowest counts for most items. Popular items include: • Shirt: 61 purchases • Blouse: 51 purchases • Boots: 48 purchases • Dress: 47 purchases • Sunglasses: 46 purchases • Scarf: 45 purchases • Sneakers: 42 purchases • Coat: 42 purchases
Category analysis.
Cluster 0: • Clothing: 765 purchases • Accessories: 548 purchases • Footwear: 259 purchases • Outerwear: 165 purchases Cluster 1: • Clothing: 480 purchases • Accessories: 335 purchases • Footwear: 171 purchases • Outerwear: 79 purchases Cluster 2: • Clothing: 492 purchases • Accessories: 357 purchases • Footwear: 169 purchases • Outerwear: 80 purchases
Conclusion.
The analysis of product, category, and seasonal trends provides valuable insights into customer purchasing behaviour. The clustering analysis identified three distinct customer segments with varying preferences for products and categories. The seasonal trends analysis highlighted significant differences in purchase amounts across different seasons, with higher spending observed during the winter season. These insights can be leveraged to tailor marketing strategies, optimise inventory management, and enhance customer engagement by targeting specific customer segments, especially aligning promotional activities with seasonal trends.
Geographical insights.
1. Highest Average Purchase Amounts: • Alaska: $67.60 • Arizona: $66.55 • Pennsylvania: $66.57 • Utah: $62.58 • West Virginia: $63.88 2. Lowest Average Purchase Amounts: • Connecticut: $54.18 • Kansas: $54.56 • Delaware: $55.33 • Wisconsin: $55.95 • Vermont: $57.18 3. Highest Purchase Amount Variability: • Arkansas: Std Dev = 26.45 • Maryland: Std Dev = 26.15 • South Dakota: Std Dev = 25.11 • Utah: Std Dev = 25.04 • Rhode Island: Std Dev = 25.27 4. Lowest Purchase Amount Variability: • Florida: Std Dev = 21.96 • Louisiana: Std Dev = 21.79 • Pennsylvania: Std Dev = 21.77 • Hawaii: Std Dev = 22.61 • New Mexico: Std Dev = 22.26
Conclusion.
The geographical analysis reveals visible variations in purchasing behaviour across different locations. States like Alaska, Arizona, and Pennsylvania have higher average purchase figures, while states like Connecticut, Kansas, and Delaware are on the lower end of the purchase spectrum. Additionally, states like Arkansas and Maryland exhibit higher variability in purchase figures, indicating a relatively diverse range of spending behaviours. This information can be leveraged to tailor marketing strategies, optimise inventory management, and enhance customer engagement by targeting specific locations with customised promotions and / or product offerings.
Promotions and discount analysis.
The promo discount summary analysis provides insights into the effectiveness of promo codes and discounts in driving sales and customer satisfaction. Here is a quick summary of the findings: 1. Promo Code Usage: • A total of 1677 transactions involved the use of promo codes. • Promo codes were always associated with discounts, indicating that every time a promo code was used, a discount was applied. 2. Discount Application: • A total of 2223 transactions did not involve the use of promo codes or discounts. • There were no instances where a discount was applied without a promo code. 3. Effectiveness of Promo Codes and Discounts: • The data suggests that promo codes are an effective tool for applying discounts, as every instance of a promo code usage resulted in a discount.
Conclusion.
The promo discount analysis shows that promo codes are a critical mechanism for applying discounts to transactions. Every instance of a promo code resulted in a discount, highlighting the effectiveness of promo codes in driving sales and customer satisfaction. These insights can be leveraged to design targeted promotional campaigns, optimise discount strategies and enhance customer engagement (encouraging the use of promo codes to avail discounts).
Shipping method analysis.
Next, let's analyse customer preferences for payment methods and shipping types to enhance the shopping experience.
Customer Preferences for Payment Methods: • Credit Card: 696 • Venmo: 653 • Cash: 648 • PayPal: 638 • Debit Card: 633 • Bank Transfer: 632 Customer Preferences for Shipping Types: • Free Shipping: 675 • Standard: 654 • Store Pickup: 650 • Next Day Air: 648 • Express: 646 • 2-Day Shipping: 627
Features influencing subscription.
I will use LightGBM as the primary model after VIF analysis, dropping specific columns should multicollinearity exist. The other columns worthy of removal would obviously be the clusters, and the column containing H3s geo-tags. I would expect latitude and longitude to return at the top of the VIF results also.
The VIF analysis shows that the features `review_rating`, `latitude`, and `longitude` have high VIF values, indicating multicollinearity among these features. The other features have VIF values below 10, suggesting no significant multicollinearity.
LightGBM
LightGBM.
Pretty good macros! Several cross-validation tests have proven to drop accuracy from around 84% to what we see here. However the recall, f1 and precision have raised with every test, as have the macro averages.... which has led me to a more technical segway re: the importance of model quality over outright model accuracy, the finer details of which I won't bore anybody with at this present moment in time for the sake of the viewer's own sanity.
SHAP values, validation data.
The ten most effective features:
1: In plain sight as per the EDA, we see the discounts, (2:) promo codes and (3:) Male gender as the three most influential features for subscription.
4: Positive values for previous purchases and ever so slightly lower purchase amounts as witnessed in the blue hues.
5: Customers who spend more per purchase may be less likely to subscribe, possibly due to a preference for one-time purchases.
6: The influence of age on subscription varies.
7: Customers who opt for faster shipping methods may value convenience and efficiency, making them more likely to subscribe.
8: Customers who shop in winter are highly likely to subscribe.
9: Customers who make purchases on a three-monthly and / or quarterly basis are less likely to subscribe.
The seasons and their relative SHAP values reflecting influence on subscription:
Garment / product colours and their relative SHAP values reflecting influence on subscription:
The ten most influential features and their means:
Dependence plots.
(A closer look at some variables)
The results of the dependence plot (weekly purchases vs. target variable) reflect that frequent shoppers are more inclined to commit to a subscription:
The SHAP dependence plot for "age" shows: 1. A non-linear Relationship: The plot indicates a non-linear relationship between age and the target variable (subscription status). Ergo: the SHAP values for age vary across different age ranges, showing both positive and negative influences. 2. Younger and older age groups tend to have higher SHAP values, indicating a stronger influence on the target variable. Middle-aged groups show a little more variability. 3. Interaction Effects: The colour gradient in the plot represents the interaction effect with another feature. This helps to understand how the interaction between age and another feature influences the target variable. Overall, the results suggests that age has a complex and variable impact on the target variable, with different age groups showing different levels of influence.
The SHAP dependence plot for "colour: Olive" shows:
The plot blow shows us again that the colour Olive has a slightly complex and variable impact on the target variable, with higher values indicating a stronger positive influence on the likelihood of subscription. This could possibly reflect customer psychology when we factor-in the season most likely to influence sales and subs (Winter), as well as possibly the customer location.