User retention charts
Read the data
select
sessions.user_id,
date_trunc('week',users.signed_up_at) as signed_up_at_week,
floor(extract('day' from session_started_at - signed_up_at)/7) as week -- The number of weeks that passed since the user signed up
from
sessions
left join users on sessions.user_id = users.user_id
Select the time range
We have data spanning 365 days, but we don't usually need to look so far back. Let's restrict our analysis to a specific time range.
StartDate
EndDate
Calculate retention
We're going to have to transform our data in order to calculate retention. This will involve a few steps:
Count the number of users in each cohort. That is, the number of users who signed up each week.
For each week, count the number of users from each cohort who were active that week.
Divide the number of users active that week by the total size of the cohort. This will give us the retention rate.
retention = sessions_weekly.copy()
# Save the cohort size before we start calculating retention
cohort_size = retention.groupby(['signed_up_at_week']).user_id.nunique().reset_index(name='cohort_size')
# For each cohort-week, calculate the number of users who visited
retention = retention.groupby(['signed_up_at_week','week']).user_id.nunique().reset_index(name='n_users')
# Pivot and melt the table. This is a little trick that allows us to add rows during weeks where a cohort was not active.
retention = retention.pivot(index=['signed_up_at_week'],columns='week',values='n_users').fillna(0)
retention = retention.melt(value_name='n_users',ignore_index=False).reset_index()
# If part of the cohort is still not finished the week, then exclude that cohort-week from the data
# To do this we add 6 days to the sign up date. This gives us the last sign ups of that cohort.
retention = retention.loc[
~(retention.signed_up_at_week + pd.to_timedelta(retention.week + 1,'W') + pd.Timedelta(6,'D') >
pd.Timestamp.now(tz='UTC').floor('D'))
]
# Divide by the cohort size to get a percentage
retention = retention.merge(cohort_size,on=['signed_up_at_week'])
retention['prop'] = retention.n_users / retention.cohort_size
Visualize
Let's start by plotting a time series of each cohort's retention rate. This is useful for identifying changes over time. For example, we can see that our cohorts in September are performing much better than the ones in May through August.
Sometimes it's helpful to look at the number of users, not just the retention rate. It's also useful to visualize retention as a matrix, to find any outliers. Once again, we can see our September cohorts is doing well.