Customer Churn Analysis
Introduction
Customer attrition (a.k.a customer churn) is one of the biggest expenditures of any organization. If we could figure out why a customer leaves and when they leave with reasonable accuracy, it would immensely help the organization to strategize their retention initiatives manifold.
let’s attempt to solve some of the key business challenges pertaining to customer attrition like say,
(1) what is the likelihood of an active customer leaving an organization?
(2) what are key indicators of a customer churn?
(3) what retention strategies can be implemented based on the results to diminish prospective customer churn?
In real-world, we need to go through seven major stages to successfully predict customer churn:
Section A: Data Preprocessing
Section B: Data Evaluation
Section C: Model Selection
Section D: Model Evaluation
Section E: Model Improvement
Section F: Future Predictions
Section G: Model Deployment
Section A: Data Preprocessing
Libraries
Dataset
we need to look at the dataset in general and each column in detail to get a better understanding of the input data so as to aggregate the fields when needed.
we get an idea that this is a telco customer churn dataset where each record entails the nature of subscription, tenure, frequency of payment and churn
A quick describe method reveals that the telecom customers are staying on average for 32 months and are paying $64 per month. However, this could potentially be because different customers have different contracts.
Identify unique values
we get an insight that the customers are either on a month-to-month rolling contract or on a fixed contract for one/two years. Also, they are paying bills via credit card, bank transfer or electronic checks.
Target variable
the data set is imbalanced with a high proportion of active customers compared to their churned counterparts.
Cleaning the dataset
Label Encode Binary data
Section B: Data Evaluation
A few observations can be made based on the histograms for numerical variables:
Gender distribution shows that the dataset features a relatively equal proportion of male and female customers.
Almost half of the customers in our dataset are female whilst the other half are male. Most of the customers in the dataset are younger people.
Not many customers seem to have dependents whilst almost half of the customers have a partner.
There are a lot of new customers in the organization (less than 10 months old) followed by a loyal customer segment that stays for more than 70 months on average.
Most of the customers seem to have phone service and 3/4th of them have opted for paperless Billing Monthly charges span anywhere between $18 to $118 per customer with a huge proportion of customers on $20 segment.
Analyze the distribution of categorical variables
Most of the customers seem to have a prepaid connection with the telecom company. On the other hand, there are a more or less equal proportion of customers in the 1-year and 2-year contracts.
The dataset indicates that customers prefer to pay their bills electronically the most followed by bank transfer, credit card and mailed checks.
Most of the customers have phone service out of which almost half of the customers have multiple lines. 3/4th of the customers have opted for internet service via Fiber Optic and DSL connections with almost half of the internet users subscribing to streaming TV and movies. Customers who have availed Online Backup, Device Protection, Technical Support and Online Security features are a minority.
Analyze the churn rate by categorical variables:
A preliminary look at the overall churn rate shows that around 74% of the customers are active. As shown in the chart above, this is an imbalanced classification problem.
Machine learning algorithms work well when the number of instances of each class is roughly equal. Since the dataset is skewed, we need to keep that in mind while choosing the metrics for model selection.
Customers with a prepaid or rather a month-to-month connection have a very high probability to churn compared to their peers on 1 or 2 years contracts.
Positive and negative correlations
Interestingly, the churn rate increases with monthly charges and paperless billing. In contrast Partner, Dependents and Tenure seem to be negatively related to churn.
Correlation Matrix
Multicollinearity using VIF
We can see here that the ‘Monthly Charges’ and ‘Total Charges’ have a high VIF value.
after dropping the ‘Total Charges’ variable, VIF values for all the independent variables have decreased to a considerable extent.
Encode Categorical data
Split the dataset
Feature Scaling
It’s quite important to normalize the variables before conducting any machine learning (classification) algorithms so that all the training and test variables are scaled within a range of 0 to 1.
Section C: Model Selection
Compare Baseline Classification Algorithms
Let’s model each classification algorithm over the training dataset and evaluate their accuracy and standard deviation scores
Classification Accuracy is one of the most common classification evaluation metrics to compare baseline algorithms as its the number of correct predictions made as a ratio of total predictions. However, it's not the ideal metric when we have class imbalance issue. Hence, let us sort the results based on the ‘Mean AUC’ value which is nothing but the model’s ability to discriminate between positive and negative classes.
Getting the right parameters for the baseline models
As we can see from the above iterations, if we use K = 22, then we will get the maximum score of 78%.
As we could see from the iterations above, the random forest model would attain the highest accuracy score when its n_estimators = 72.
In the second iteration of comparing baseline classification algorithms, we would be using the optimized parameters for KNN and Random Forest models. Also, we know that false negatives are more costly than false positives in a churn and hence let’s use precision, recall and F2 scores as the ideal metric for the model selection
From the 2nd iteration, we can definitely conclude that logistic regression is an optimal model of choice for the given dataset as it has relatively the highest combination of precision, recall and F2 scores; giving most number of correct positive predictions while minimizing the false negatives. Hence, let's try to use Logistic Regression and evaluate its performance
Section D: Model Evaluation
Train & evaluate Chosen Model
k-Fold Cross-Validation: Model evaluation is most commonly done through ‘K- fold Cross-Validation’ technique that primarily helps us to fix the variance.
Variance problem occurs when we get good accuracy while running the model on a training set and a test set but then the accuracy looks different when the model is run on another test set.
Therefore, our k-fold Cross Validation results indicate that we would have an accuracy anywhere between 76% to 84% while running this model on any test set.
Visualize results on a Confusion Matrix
Evaluate the model using ROC Graph
ROC Graph shows us the capability of a model to distinguish between the classes based on the AUC Mean score.
Predict Feature Importance
Logistic Regression allows us to determine the key features that have significance in predicting the target attribute (“Churn” in this project).
Section E: Model Improvement
Model improvement basically involves choosing the best parameters for the machine learning model that we have come up with. There are two types of parameters in any machine learning model — the first type are the kind of parameters that the model learns; the optimal values automatically found by running the model. The second type of parameters is the ones that user get to choose while running the model. Such parameters are called the hyperparameters; a set of configurable values external to a model that cannot be determined by the data, and that we are trying to optimize through Parameter Tuning techniques like Random Search or Grid Search.
Hyper parameter Tuning via Grid Search
Final Hyperparameter tuning and selection
Section F: Future Predictions
Compare predictions against the test set:
Format Final Results
Unpredictability and risk are the close companions of any predictive models. Therefore in the real world, its always a good practice to build a propensity score besides an absolute predicted outcome. Instead of just retrieving a binary estimated target outcome (0 or 1), every ‘Customer ID’ could get an additional layer of propensity score highlighting their percentage of probability to take the target action.
Conclusion
So, in a nutshell, we made use of a customer churn dataset from Kaggle to build a machine learning classifier that predicts the propensity of any customer to churn in months to come with a reasonable accuracy score of 76% to 84%.
Classify the customers
What's next
-- Share key insights about the customer demographics and churn rate that you have garnered from the exploratory data analysis sections to the sales and marketing team of the organization. Let the sales team know the features that have positive and negative correlations with churn so that they could strategize the retention initiatives accordingly.
-- Further, classify the upcoming customers based on the propensity score as high risk (for customers with propensity score > 80%), medium risk (for customers with a propensity score between 60–80%) and lastly low-risk category (for customers with propensity score <60%). Focus on each segment of customers upfront and ensure that there needs are well taken care of.
-- Lastly, measure the return on investment (ROI) of this assignment by computing the attrition rate for the current financial quarter. Compare the quarter results with the same quarter last year or the year before and share the outcome with the senior management of your organization.