# Survival Analysis: Analyzing Churn and Improving Customer Retention as a SaaS Company

At **Traction Tools** we're highly commmited to make our
clients succeed. We run a platform for EOS, which is a system that facilitates entreprenuers to run their
business, internal operations, and effective meetings on the cloud.

However, as a SaaS company, it's very common to deal with issues like churn and customer retention. Here we're going to discuss how we analyze churn and what are some of the important factors that makes our customer stay or cancel their subscription.

It is very common for companies to try to predict customer churn using the so-called black-box models which are highly complex algorithms that can detect if a client is going to cancel their subscription based on a number of factors.

This is not necessarily bad, but **there are better ways to predict tenure and calculate the probabilities
of a user churning while using interpretable models which helps us understand what is causing our users
to cancel their subscription.**

This short article is aimed at data scientist and business analysts that would like to have a better understanding on how to calculate a churn probability for a client, causes, and the overall churn ratio.

## Introduction

Because we do software for EOS, and we offer our platform to users that would love to have effective meetings.
Following our business model we have **teams** which include an `n`

number of users, and an
`n`

number of meetings run every week per team.

This allows us to have a sample dataset that includes:

**Weekly Average Meetings:**How many meetings the user runs per week.**Active User Count:**How many users are within a team.**Has Churned:**Marks the observable event of*"death"*(i.e: cancellation.)**Cluster Labels:**A categorical variable that tell us if the account has high or low activity in the platform.**Tenure:**How many days was the account active in our platform

*In this article we will work only with a sample containing synthetic data and limited features
to maintain sensitive information private.*

## Key Objectives

Key objectives from this analysis are:

- Performing a basic and short EDA (Exploratory Data Analysis) to get insights
- Getting the median lifetime of our customers
- Validating if the median lifetime varies per account activity

To run this analysis we'll use a Python environment with libraries such as Pandas, Matplotlib, and Lifelines. Without further ado, let's start jumping right into the exploratory data analysis (EDA).

# Exploratory Data Analysis

Let's start by importing some libraries that we're going to use and also our dataframe to inspect it.

The first thing I'm noticing in this sample is that there is a low number of *high activity* accounts.
Let's clarify our assumption by running a `.value_counts()`

method on our dataframe.

As expected, **83%** of our sample is composed by low activity accounts, while **17%** of it
is taken by high activity accounts.

It would be a good idea to visualize this number in a horizontal bar chart, fortunately this can be easily
done using the pandas method `.plot()`

.

Having a visual representation helps us to identify an issue here, if we run a survival analysis we might have to divide these two groups to better understand their behaviors and lifetime in our platform.

Now let's consider the **tenure** column, which is the one that will tell us how much time does a client
stays with us. We can run the `.describe()`

method to get some basic statistics on about this feature.

The first thing I want to do before we run the `.describe()`

method is transforming the column to months
instead of days.

Now, we have `3,063`

observations here, and we can notice that the mean tenure is **17 months**, with a
lower bound of **8 months** and an upper bound of **23 months**.

However, this is not the appropriate way of measuring churn because we cannot say that *every* client
stay with us from 8 to 23 months as not everyone has the same experience, furthermore, we already saw that
we have different group of accounts and this information might vary wildly.

Now, let's compare the behavior of accounts that have cancelled against accounts that are still active in this sample and try to get some insights.

What we have here is a multi-index describing the mean and median values for our features, broken down by account activity for active and cancelled accounts.

Ok, that was a mouthful, but let's focus on meetings first:

- When it comes to having weekly meetings, active and cancelled accounts with
**high activity**have the same number of meetings on average a week, but a difference of -2.5 when we calculate it using the median. - For active accounts with
**low activity**, the mean and the median don't differ wildly. Two meetings a week is reasonable, however, we can notice that for cancelled accounts the number of weekly meetings changes to**1**instead of**2**meetings a week.

Now, what can we conclude about the active user count in each team?

- When it comes to
**high activity**teams the mean and median value for active and cancelled accounts don't differ much. - On the other hand, for
**low_activity**accounts that are active we observe that they usually have about 10 users in their team, however, for cancelled accounts we can observe that they usually have 7 users on their subscription, which is a different behavior that requires further analysis.

## EDA Key Insights

With this information we can already conclude that keeping our users busy in the platform is *paramount*
to retain them, and because Traction Tools is a collaborative space, having more users in their team
increases engagement.

We can already start developing retention strategies to succeed with our customers. Based on this information we can also build machine learning algorithms to detect churn, anomalies, and clients that will provide more value over time.

Let's go a bit further and try to estimates probabilities around the insights we have discovered.

# Using Lifelines for Survival Analysis

There's a great library out there for properly doing survival analysis created by
Cameron Davidson-Pilon called **Lifelines**.

One of the best libraries for survival analysis that I've tried so far. Let's use this to analyze the chance of survival at any time of our clients subscription.

## Global Survivability Rates

We'll start using the **Kaplan-Meier** fitter to analyze the the survivability rates for the whole
population.

Now we can see the total survival chance of our population at any point in time. In the example above
we can observe that there's initially a **100%** chance of survival and it slowly declines as the time
goes by.

Of **408** observations, we can see that in the 10th month **232** of them still have an active
subscription, but **176** of them have already cancelled.

Now let's try to get the median survival time and also the survival chance at this point in time.

It seems that after the 11th month our clients have a 50/50 chance of cancelling their subscription. Let's try now getting a lower bound and upper bound to make sure we have a confidence interval instead of only the median value.

Now we know that we should take care of accounts that are between 10 to 13 months old. Using this information we can trigger actions to take care of these customers in order to improve their lifespan in the platform.

## Segmented Survivability Rates

However, there's one thing that we have to notice, these are values for the entire population, but we know
that **we have different types of clients in our sample**, and we should separate these two
populations and observe their behavior.

To achieve this we'll separate our populations using the `clustered_labels`

column which separates the
accounts by activity.