Customer Segmentation and Cohort Analysis
Many times in Traction Tools we have tried to segment our users into multiple groups. We've done that by industry, company sizes, lead sources, and so on. While those methods haven't exactly failed it's hard to use them to give us a reliable perspective to cluster our users into a group.
In this analysis we will group our users based on their last purchase, how many subscriptions they have purchased, and how much revenue we've got from them. This is known as Recency, Frequency, and Monetary Value (RFM).
Using the RFM we can segment our users on those three metrics. We will then assign them a score and a natural group such as Gold, Silver, and Bronze. Before jumping into that, we will analyze our customers retention percentage and purchase behavior based on their cohort.
- Analyze Retention and Purchase behaviors based on monthly cohorts
- Segment Customers using their RFM values
- Segment Customers using an unsupervised machine learning algorithm
- About the dataset: In this exercise we will work with our invoices dataset, which is available in our data warehouse to everyone. Also, only paid invoices are displayed here.
- About this experiment: This is a living document, therefore, if you need the most up-to-date information you can just run this notebook and you will see the latest information.
- About the reader: The results of the experiment displayed here are simple and can be followed by all everyone. The machine learning section will be more suited for people with more expertise in unsupervised learning.
Customer Retention and Spending Habits
Before diving into the segmentation, let's first analyze the retention rates and spending habits of our users. To achieve this we will analyze our customers since August 2019 and see how many of them have churned and how much they spend.
In the graph bellow we can see the retention behavior across different cohorts since August 2019. The way you read this
chart is in a horizontal way, we see that retention percentage for the cohort
2019-08 changed from 92% to 87%,
and then was down to 75%.
Retention rates for this cohort are great, however we don't see the same behavior for cohort
2019-12 which the
retention rate was down to 74% only in its sixth month.
Looking at the big picture, retention rates for paying clients are high enough, which is very good. The higher the intensity of the color blue, the higher is our retention rate, and we don't see much paleness until the eleventh month. You can confirm this on our previous analysis:
The spending habits of each cohort are very interesting. The intensity of the red bar describes how much money our
clients spend on average on each month. The cohort
2019-08 spent on average 266 USD on the 14th month, while
2020-04 was not very generous, most likely due to how covid hit every business out there on a global scale.
However, looking at the big picture we have to notice that our users take too long to invest in our company, we must optimize our strategy to improve their spending habits with us and have more revenue in the early stages after the onboarding process.
Recency, Frequency, and Monetary (RFM) Segmentation
Having analyzed our customer's behaviors around purchases and retention, it is time to segment our users using RFM scores. This will allow us to better understand each customer group, have actionable steps around them, run campaings to target these clusters, and much more.
Calculating RFM Values
Before diving in, let's clarify one more time what each attribute represents:
- Recency: The number of days since last purchase. This number should not be more than 30-60 days as we're a SaaS company that takes profit on the susbscription of each user. Lower the better.
- Frequency: Is the number of times the user has purchased a subscription. Higher the better.
- Monetary Value: How much the customer has invested in our services. The higher this value is the better.
In this case, the organization
71 (Firespring) had its last purchase 26 days ago (Recency), have made
59 purchases (Frequency), and has invested $25,990 (Monetary Value).
All of this information is coming from our invoices dataset, and we'll use this information to:
- Assign scores for Recency, Frequency, and Monetary Value.
- Create a segment for them
- Assign them a score which would be the
Based on the RFM values from our customers, we will now:
- Assign an individual score to each RFM attribute:
- Recency and Frequency will have a score of $1$ to $3$. The higher the better.
- Monetary Value will have a score of $1$ to $4$. Higher values are also better.
- Segment users based on their score:
- This will help us identify users to have a low frequency value, but a high monetary score. The segment $314$ would identify a user who has great recency ($3$), low frequency ($1$), and high monetary value ($4$).
- Assign a cumulative score based on the RFM values:
- This will be the sum of the scores for each user. For example, the segment $334$ would have a score of $10$ because $3+3+4 = 10$
Bellow is the result, to set an example, let's examine the organization
71. This organization has low recency, but great
Frequency and great Monetary Value. This puts this organization in the segment 144, with an score of 8.
Now let's analyze the average RFM for each score. Users segmented with a score of $10$ have a Recency of $5.8$ days, a Frequency of $33.5$, and a Monetary Value of $$9778.8$; in this segment we have $179$ customers right now.
That is great. However, our second largest group is the segment with a score of $6$. These clients have a regular recency and a regular frequency of $11.6$ months, but only provide a monetary value of $$521$. I'm pretty sure that there's a lot we can do to improve the score of this segment in particular.
It's great to see this information, however, it makes it hard to read. To go even further, we will assign a natural segment to our clients and we will be classifying them as Gold, Silver, or Bronze. Having these segments can be great if we want to target users for campaigns.
As before, we can see the average RFM values for each segment. Particularly, the Bronze is the lowest tier, and we should make efforts to move those clients into a segment with a better score, such as Silver or Gold.
This information is useful if we want to run campaigns based on the RFM values of our customers, improve operations if we want develop strategies for user retention, or just analyzing the patterns and behaviors of our customers.
Also, this information is available to anyone in the company that needs it. If you are working in Marketing, Sales, Support, Finance, or anything that has to do with our users, I urge you to look at this information next time you're developing your strategies.
- What do you want to target?
- What can we do to improve the Monetary Value score of our Silver segment?
- What other services can we offer to improve the Recency score?
- How can we improve the frequency of our users to have them stay longer with us?
In the following chapter we're going to cover a more advance clustering algorithm using machine learning. If you want to learn about it I invite you to keep on reading, however if you want to skip it that perfectly fine.
If you want to have access to the RFM values for each client to improve your strategies please feel free to reach out to me or Sergio.
These chapters are going to be a quick overview on how to apply a K-Means clustering algorithm to our userbase to uncover some insights.
There are certain assumptions that K-Means make when clustering. Two of the most important ones are that the
standard deviation are centered and scaled. We can see that this is not the case when evaluating the RFM values.
To improve the skewness we have to center and scale our data. We can do this easily with the
from the scikit-learn library:
- First, we calculate the
logvalues for each RFM attribute.
- We scale the log values.
- And finally, we create a dataframe with normalized data.
Now that distributions are centered, we can see the effect of the
StandardScaler() method using a graph.
Evaluating Cluster Number: Elbow Criterion
Now, the question: How can we decide the numbers of clusters?
To answer this question we can se the Elbow Criterion. We will create many ML models and then store the sum of squared distances into a dictionary. We'll now then use it in a Point Plot to measure which cluster is better suited for our case.
The cluster that we should use is one close to the point in which we see the elbow of the curve. In this case I chose 4.
Now that we have an optimal number of clusters, we run the K-Means algorithm with 4 clusters to segment our users.
Now, we can se the average values for Recency, Frequency and Monetary Value for each cluster. You will find that it is different from our previous method. There is not a better method in this case. For RFM values we should chose an algorithm that makes sense from a business perspective.
Snake Plot and Attribute Importance
Let's visualize now our clusters with a snake plot. This will allow us to visualize clusters with, let's say, low recency , but high Frequency and high Monetary Value.
We can also plot an Attribute Importance to see that's important for our customers when it comes to RFM. For some clusters Monetary Value might be important because they're not big companies, however they do want to be part of our software.
Lastly, we dump this information into our data warehouse to make it available to everyone that needs it.
We've reached the end of this analysis. This is something that can help you develop your strategies, or maybe you are only curious and want to learn more about our users. I hope you enjoyed this entry. Feel free to reach out to me or Sergio if you have more questions 👋.