Customer Churn Analysis

Introduction

Customer attrition (a.k.a customer churn) is one of the biggest expenditures of any organization. If we could figure out why a customer leaves and when they leave with reasonable accuracy, it would immensely help the organization to strategize their retention initiatives manifold.

let’s attempt to solve some of the key business challenges pertaining to customer attrition like say,

(1) what is the likelihood of an active customer leaving an organization?

(2) what are key indicators of a customer churn?

(3) what retention strategies can be implemented based on the results to diminish prospective customer churn?

In real-world, we need to go through seven major stages to successfully predict customer churn:

Section A: Data Preprocessing

Section B: Data Evaluation

Section C: Model Selection

Section D: Model Evaluation

Section E: Model Improvement

Section F: Future Predictions

Section G: Model Deployment

Section A: Data Preprocessing

Libraries

!pip install statsmodels !pip install xgboost

#Standard libraries for data analysis: import numpy as np import matplotlib.pyplot as plt import pandas as pd from scipy.stats import norm, skew from scipy import stats import statsmodels.api as sm

# sklearn modules for data preprocessing: from sklearn.impute import SimpleImputer from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler

#sklearn modules for Model Selection: from sklearn import svm, tree, linear_model, neighbors from sklearn import naive_bayes, ensemble, discriminant_analysis, gaussian_process from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from xgboost import XGBClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier

#sklearn modules for Model Evaluation & Improvement: from sklearn.metrics import confusion_matrix, accuracy_score from sklearn.metrics import f1_score, precision_score, recall_score, fbeta_score from statsmodels.stats.outliers_influence import variance_inflation_factor from sklearn.model_selection import cross_val_score from sklearn.model_selection import GridSearchCV from sklearn.model_selection import ShuffleSplit from sklearn.model_selection import KFold from sklearn import feature_selection from sklearn import model_selection from sklearn import metrics from sklearn.metrics import classification_report, precision_recall_curve from sklearn.metrics import auc, roc_auc_score, roc_curve from sklearn.metrics import make_scorer, recall_score, log_loss from sklearn.metrics import average_precision_score

#Standard libraries for data visualization: import seaborn as sn from matplotlib import pyplot import matplotlib.pyplot as plt import matplotlib.pylab as pylab import matplotlib %matplotlib inline color = sn.color_palette() import matplotlib.ticker as mtick from IPython.display import display pd.options.display.max_columns = None from pandas.plotting import scatter_matrix from sklearn.metrics import roc_curve

#Miscellaneous Utilitiy Libraries: import random import os import re import sys import timeit import string import time from datetime import datetime from time import time from dateutil.parser import parse import joblib

Dataset

dataset = pd.read_csv('/work/customer_churn_data.csv')

dataset

we need to look at the dataset in general and each column in detail to get a better understanding of the input data so as to aggregate the fields when needed.

dataset.head()

we get an idea that this is a telco customer churn dataset where each record entails the nature of subscription, tenure, frequency of payment and churn

dataset.columns

dataset.describe()

A quick describe method reveals that the telecom customers are staying on average for 32 months and are paying $64 per month. However, this could potentially be because different customers have different contracts.

dataset.dtypes

dataset.columns.to_series().groupby(dataset.dtypes).groups

dataset.info()

dataset.isna().any()

Identify unique values

#Unique values in each categorical variable: dataset["PaymentMethod"].nunique() dataset["PaymentMethod"].unique()

dataset["Contract"].nunique() dataset["Contract"].unique()

we get an insight that the customers are either on a month-to-month rolling contract or on a fixed contract for one/two years. Also, they are paying bills via credit card, bank transfer or electronic checks.

Target variable

dataset["Churn"].value_counts()

the data set is imbalanced with a high proportion of active customers compared to their churned counterparts.

Cleaning the dataset

dataset['TotalCharges'] = pd.to_numeric(dataset['TotalCharges'],errors='coerce') dataset['TotalCharges'] = dataset['TotalCharges'].astype("float")

Label Encode Binary data

#Create a label encoder object le = LabelEncoder() # Label Encoding will be used for columns with 2 or less unique values le_count = 0 for col in dataset.columns[1:]: if dataset[col].dtype == 'object': if len(list(dataset[col].unique())) <= 2: le.fit(dataset[col]) dataset[col] = le.transform(dataset[col]) le_count += 1 print('{} columns were label encoded.'.format(le_count))

Section B: Data Evaluation

Exploratory Data Analysis

dataset2 = dataset[['gender', 'SeniorCitizen', 'Partner','Dependents', 'tenure', 'PhoneService', 'PaperlessBilling', 'MonthlyCharges', 'TotalCharges']] #Histogram: fig = plt.figure(figsize=(15, 12)) plt.suptitle('Histograms of Numerical Columns\n',horizontalalignment="center",fontstyle = "normal", fontsize = 24, fontfamily = "sans-serif") for i in range(dataset2.shape[1]): plt.subplot(6, 3, i + 1) f = plt.gca() f.set_title(dataset2.columns.values[i]) vals = np.size(dataset2.iloc[:, i].unique()) if vals >= 100: vals = 100 plt.hist(dataset2.iloc[:, i], bins=vals, color = '#ec838a') plt.tight_layout(rect=[0, 0.03, 1, 0.95])

A few observations can be made based on the histograms for numerical variables:

Gender distribution shows that the dataset features a relatively equal proportion of male and female customers.

Almost half of the customers in our dataset are female whilst the other half are male. Most of the customers in the dataset are younger people.

Not many customers seem to have dependents whilst almost half of the customers have a partner.

There are a lot of new customers in the organization (less than 10 months old) followed by a loyal customer segment that stays for more than 70 months on average.

Most of the customers seem to have phone service and 3/4th of them have opted for paperless Billing Monthly charges span anywhere between $18 to $118 per customer with a huge proportion of customers on $20 segment.

Analyze the distribution of categorical variables

Distribution of contract type

contract_split = dataset[[ "customerID", "Contract"]] sectors = contract_split.groupby ("Contract") contract_split = pd.DataFrame(sectors["customerID"].count()) contract_split.rename(columns={'customerID':'No. of customers'}, inplace=True)

ax = contract_split[["No. of customers"]].plot.bar(title = 'Customers by Contract Type',legend =True, table = False, grid = False, subplots = False,figsize =(12, 7), color ='#ec838a', fontsize = 15, stacked=False) plt.ylabel('No. of Customers\n', horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.xlabel('\n Contract Type', horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.title('Customers by Contract Type \n', horizontalalignment="center",fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif") plt.legend(loc='upper right', fontsize = "medium") plt.xticks(rotation=0, horizontalalignment="center") plt.yticks(rotation=0, horizontalalignment="right") x_labels = np.array(contract_split[["No. of customers"]]) def add_value_labels(ax, spacing=5): for rect in ax.patches: y_value = rect.get_height() x_value = rect.get_x() + rect.get_width() / 2 space = spacing va = 'bottom' if y_value < 0: space *= -1 va = 'top' label = "{:.0f}".format(y_value) ax.annotate( label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va=va) add_value_labels(ax)

Most of the customers seem to have a prepaid connection with the telecom company. On the other hand, there are a more or less equal proportion of customers in the 1-year and 2-year contracts.

Distribution of payment method type

payment_method_split = dataset[[ "customerID", "PaymentMethod"]] sectors = payment_method_split .groupby ("PaymentMethod") payment_method_split = pd.DataFrame(sectors["customerID"].count()) payment_method_split.rename(columns={'customerID':'No. of customers'}, inplace=True)

ax = payment_method_split[["No. of customers"]].plot.bar(title = 'Customers by Payment Method', legend =True, table = False, grid = False, subplots = False, figsize =(15, 10),color ='#ec838a', fontsize = 15, stacked=False) plt.ylabel('No. of Customers\n', horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.xlabel('\n Contract Type', horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.title('Customers by Payment Method \n', horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif") plt.legend(loc='upper right', fontsize = "medium") plt.xticks(rotation=0, horizontalalignment="center") plt.yticks(rotation=0, horizontalalignment="right") x_labels = np.array(payment_method_split [["No. of customers"]]) def add_value_labels(ax, spacing=5): for rect in ax.patches: y_value = rect.get_height() x_value = rect.get_x() + rect.get_width() / 2 space = spacing va = 'bottom' if y_value < 0: space *= -1 va = 'top' label = "{:.0f}".format(y_value) ax.annotate(label,(x_value, y_value),xytext=(0, space),textcoords="offset points",ha='center',va=va) add_value_labels(ax)

The dataset indicates that customers prefer to pay their bills electronically the most followed by bank transfer, credit card and mailed checks.

Distribution of label encoded categorical variables

services= ['PhoneService','MultipleLines', 'InternetService','OnlineSecurity', 'OnlineBackup','DeviceProtection', 'TechSupport','StreamingTV','StreamingMovies']

fig, axes = plt.subplots(nrows = 3,ncols = 3,figsize = (15,12)) for i, item in enumerate(services): if i < 3: ax = dataset[item].value_counts().plot(kind = 'bar',ax=axes[i,0],rot = 0, color ='#f3babc' ) elif i >=3 and i < 6: ax = dataset[item].value_counts().plot( kind = 'bar',ax=axes[i-3,1],rot = 0,color ='#9b9c9a') elif i < 9: ax = dataset[item].value_counts().plot(kind = 'bar',ax=axes[i-6,2],rot = 0,color = '#ec838a') ax.set_title(item)

Most of the customers have phone service out of which almost half of the customers have multiple lines. 3/4th of the customers have opted for internet service via Fiber Optic and DSL connections with almost half of the internet users subscribing to streaming TV and movies. Customers who have availed Online Backup, Device Protection, Technical Support and Online Security features are a minority.

Analyze the churn rate by categorical variables:

import matplotlib.ticker as mtick churn_rate = dataset[["Churn", "customerID"]] churn_rate ["churn_label"] = pd.Series(np.where((churn_rate["Churn"] == 0), "No", "Yes")) sectors = churn_rate.groupby ("churn_label") churn_rate = pd.DataFrame(sectors["customerID"].count()) churn_rate["Churn Rate"] = (churn_rate ["customerID"]/ sum(churn_rate ["customerID"]) )*100

ax = churn_rate[["Churn Rate"]].plot.bar(title = 'Overall Churn Rate',legend =True, table = False,grid = False, subplots = False,figsize =(12, 7), color = '#ec838a', fontsize = 15, stacked=False, ylim =(0,100)) plt.ylabel('Proportion of Customers',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.xlabel('Churn',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.title('Overall Churn Rate \n',horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif") plt.legend(loc='upper right', fontsize = "medium") plt.xticks(rotation=0, horizontalalignment="center") plt.yticks(rotation=0, horizontalalignment="right") ax.yaxis.set_major_formatter(mtick.PercentFormatter()) x_labels = np.array(churn_rate[["customerID"]]) def add_value_labels(ax, spacing=5): for rect in ax.patches: y_value = rect.get_height() x_value = rect.get_x() + rect.get_width() / 2 space = spacing va = 'bottom' if y_value < 0: space *= -1 va = 'top' label = "{:.1f}%".format(y_value) ax.annotate(label,(x_value, y_value),xytext=(0, space), textcoords="offset points",ha='center',va=va) add_value_labels(ax) ax.autoscale(enable=False, axis='both', tight=False)

A preliminary look at the overall churn rate shows that around 74% of the customers are active. As shown in the chart above, this is an imbalanced classification problem.

Machine learning algorithms work well when the number of instances of each class is roughly equal. Since the dataset is skewed, we need to keep that in mind while choosing the metrics for model selection.

Churn Rate by Contract Type

import matplotlib.ticker as mtick contract_churn =dataset.groupby(['Contract','Churn']).size().unstack() contract_churn.rename(columns={0:'No', 1:'Yes'}, inplace=True)

colors = ['#ec838a','#9b9c9a'] ax = (contract_churn.T*100.0 / contract_churn.T.sum()).T.plot(kind='bar',width = 0.3,stacked = True,rot = 0,figsize = (12,7),color = colors) plt.ylabel('Proportion of Customers\n',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.xlabel('Contract Type\n',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.title('Churn Rate by Contract type \n',horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif") plt.legend(loc='upper right', fontsize = "medium") plt.xticks(rotation=0, horizontalalignment="center") plt.yticks(rotation=0, horizontalalignment="right") ax.yaxis.set_major_formatter(mtick.PercentFormatter()) for p in ax.patches: width, height = p.get_width(), p.get_height() x, y = p.get_xy() ax.text(x+width/2,y+height/2,'{:.1f}%'.format(height),horizontalalignment='center',verticalalignment='center') ax.autoscale(enable=False, axis='both', tight=False)

Customers with a prepaid or rather a month-to-month connection have a very high probability to churn compared to their peers on 1 or 2 years contracts.

Churn Rate by Payment Method Type

contract_churn = dataset.groupby(['Contract', 'PaymentMethod']).size().unstack() contract_churn.rename(columns= {0:'No', 1:'Yes'}, inplace=True)

colors = ['#ec838a','#9b9c9a', '#f3babc' , '#4d4f4c'] ax = (contract_churn.T*100.0 / contract_churn.T.sum()).T.plot( kind='bar',width = 0.3,stacked = True,rot = 0,figsize = (12,7), color = colors) plt.ylabel('Proportion of Customers\n', horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.xlabel('Contract Type\n',horizontalalignment="center", fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.title('Churn Rate by Payment Method \n', horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif") plt.legend(loc='upper right', fontsize = "medium") plt.xticks(rotation=0, horizontalalignment="center") plt.yticks(rotation=0, horizontalalignment="right") ax.yaxis.set_major_formatter(mtick.PercentFormatter()) for p in ax.patches: width, height = p.get_width(), p.get_height() x, y = p.get_xy() ax.text(x+width/2, y+height/2, '{:.1f}%'.format(height), horizontalalignment='center', verticalalignment='center') ax.autoscale(enable=False, axis='both', tight=False)

Positive and negative correlations

dataset2 = dataset[['SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'PaperlessBilling', 'MonthlyCharges', 'TotalCharges']]

correlations = dataset2.corrwith(dataset.Churn) correlations = correlations[correlations!=1] positive_correlations = correlations[correlations >0].sort_values(ascending = False) negative_correlations =correlations[correlations<0].sort_values(ascending = False) print('Most Positive Correlations: \n', positive_correlations) print('\nMost Negative Correlations: \n', negative_correlations)

correlations.plot.bar(figsize = (18, 10),fontsize = 15,color = '#ec838a',rot = 45, grid = True) plt.title('Correlation with Churn Rate \n',horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif")

Interestingly, the churn rate increases with monthly charges and paperless billing. In contrast Partner, Dependents and Tenure seem to be negatively related to churn.

Correlation Matrix

#Set and compute the Correlation Matrix: sn.set(style="white") corr = dataset2.corr() #Generate a mask for the upper triangle: mask = np.zeros_like(corr, dtype=np.bool) mask[np.triu_indices_from(mask)] = True #Set up the matplotlib figure and a diverging colormap: f, ax = plt.subplots(figsize=(18, 15)) cmap = sn.diverging_palette(220, 10, as_cmap=True) #Draw the heatmap with the mask and correct aspect ratio: sn.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})

Multicollinearity using VIF

def calc_vif(X): # Calculating VIF vif = pd.DataFrame() vif["variables"] = X.columns vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] return(vif)

dataset2 = dataset[['gender', 'SeniorCitizen', 'Partner', 'Dependents','tenure', 'PhoneService','PaperlessBilling','MonthlyCharges','TotalCharges']] dataset2 = dataset2.dropna() calc_vif(dataset2)

We can see here that the ‘Monthly Charges’ and ‘Total Charges’ have a high VIF value.

#'Total Charges' seem to be collinear with 'Monthly Charges'. #Check colinearity: dataset2[['MonthlyCharges', 'TotalCharges']].plot.scatter(figsize = (15, 10), x ='MonthlyCharges',y='TotalCharges', color = '#ec838a') plt.title('Collinearity of Monthly Charges and Total Charges \n',horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif")

#Dropping 'TotalCharges': dataset2 = dataset2.drop(columns = "TotalCharges") #Revalidate Colinearity: dataset2 = dataset[['gender', 'SeniorCitizen', 'Partner', 'Dependents','tenure', 'PhoneService', 'PaperlessBilling','MonthlyCharges']] calc_vif(dataset2)

#Applying changes in the main dataset: dataset = dataset.drop(columns = "TotalCharges")

after dropping the ‘Total Charges’ variable, VIF values for all the independent variables have decreased to a considerable extent.

Encode Categorical data

#Incase if user_id is an object: identity = dataset["customerID"] dataset = dataset.drop(columns="customerID") #Convert rest of categorical variable into dummy: dataset= pd.get_dummies(dataset) #Rejoin userid to dataset: dataset = pd.concat([dataset, identity], axis = 1)

Split the dataset

#Identify response variable: y = dataset["Churn"] dataset = dataset.drop(columns="Churn")

#use 0.9 for test size if data is huge. X_train, X_test, y_train, y_test = train_test_split(dataset, y,stratify=y, test_size = 0.2, random_state = 0) #to resolve any class imbalance - use stratify parameter. print("Number transactions X_train dataset: ", X_train.shape) print("Number transactions y_train dataset: ", y_train.shape) print("Number transactions X_test dataset: ", X_test.shape) print("Number transactions y_test dataset: ", y_test.shape)

train_identity = X_train['customerID'] X_train = X_train.drop(columns = ['customerID']) test_identity = X_test['customerID'] X_test = X_test.drop(columns = ['customerID'])

Feature Scaling

It’s quite important to normalize the variables before conducting any machine learning (classification) algorithms so that all the training and test variables are scaled within a range of 0 to 1.

sc_X = StandardScaler() X_train2 = pd.DataFrame(sc_X.fit_transform(X_train)) X_train2.columns = X_train.columns.values X_train2.index = X_train.index.values X_train = X_train2 X_test2 = pd.DataFrame(sc_X.transform(X_test)) X_test2.columns = X_test.columns.values X_test2.index = X_test.index.values X_test = X_test2

Section C: Model Selection

Compare Baseline Classification Algorithms

1st Iteration

Let’s model each classification algorithm over the training dataset and evaluate their accuracy and standard deviation scores

Classification Accuracy is one of the most common classification evaluation metrics to compare baseline algorithms as its the number of correct predictions made as a ratio of total predictions. However, it's not the ideal metric when we have class imbalance issue. Hence, let us sort the results based on the ‘Mean AUC’ value which is nothing but the model’s ability to discriminate between positive and negative classes.

models = [] models.append(('Logistic Regression', LogisticRegression(solver='liblinear', random_state = 0,class_weight='balanced'))) models.append(('SVC', SVC(kernel = 'linear', random_state = 0))) models.append(('Kernel SVM', SVC(kernel = 'rbf', random_state = 0))) models.append(('KNN', KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2))) models.append(('Gaussian NB', GaussianNB())) models.append(('Decision Tree Classifier',DecisionTreeClassifier(criterion = 'entropy', random_state = 0))) models.append(('Random Forest', RandomForestClassifier(n_estimators=100, criterion = 'entropy', random_state = 0)))

#Evaluating Model Results: acc_results = [] auc_results = [] names = [] # set table to table to populate with performance results col = ['Algorithm', 'ROC AUC Mean', 'ROC AUC STD','Accuracy Mean', 'Accuracy STD'] model_results = pd.DataFrame(columns=col) i = 0 # Evaluate each model using k-fold cross-validation: for name, model in models: kfold = model_selection.KFold(n_splits=10) # accuracy scoring: cv_acc_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy') # roc_auc scoring: cv_auc_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring='roc_auc') acc_results.append(cv_acc_results) auc_results.append(cv_auc_results) names.append(name) model_results.loc[i] = [name,round(cv_auc_results.mean()*100, 2),round(cv_auc_results.std()*100, 2),round(cv_acc_results.mean()*100, 2),round(cv_acc_results.std()*100, 2)] i += 1 model_results.sort_values(by=['ROC AUC Mean'], ascending=False)

Visualize Classification Algorithms Accuracy Comparisons

fig = plt.figure(figsize=(15, 7)) ax = fig.add_subplot(111) plt.boxplot(acc_results) ax.set_xticklabels(names) #plt.ylabel('ROC AUC Score\n',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") #plt.xlabel('\n Baseline Classification Algorithms\n',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.title('Accuracy Score Comparison \n',horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif") #plt.legend(loc='top right', fontsize = "medium") plt.xticks(rotation=0, horizontalalignment="center") plt.yticks(rotation=0, horizontalalignment="right") plt.show()

Using Area under ROC Curve

fig = plt.figure(figsize=(15, 7)) ax = fig.add_subplot(111) plt.boxplot(auc_results) ax.set_xticklabels(names) #plt.ylabel('ROC AUC Score\n',horizontalalignment="center",fontstyle = "normal",fontsize = "large", fontfamily = "sans-serif") #plt.xlabel('\n Baseline Classification Algorithms\n',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.title('ROC AUC Comparison \n',horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif") #plt.legend(loc='top right', fontsize = "medium") plt.xticks(rotation=0, horizontalalignment="center") plt.yticks(rotation=0, horizontalalignment="right") plt.show()

Getting the right parameters for the baseline models

the optimal number of K neighbors for KNN Model

score_array = [] for each in range(1,25): knn_loop = KNeighborsClassifier(n_neighbors = each) #set K neighbor as 3 knn_loop.fit(X_train,y_train) score_array.append(knn_loop.score(X_test,y_test)) fig = plt.figure(figsize=(15, 7)) plt.plot(range(1,25),score_array, color = '#ec838a') plt.ylabel('Range\n',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.xlabel('Score\n',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.title('Optimal Number of K Neighbors \n',horizontalalignment="center", fontstyle = "normal",fontsize = "22", fontfamily = "sans-serif") #plt.legend(loc='top right', fontsize = "medium") plt.xticks(rotation=0, horizontalalignment="center") plt.yticks(rotation=0, horizontalalignment="right") plt.show()

As we can see from the above iterations, if we use K = 22, then we will get the maximum score of 78%.

the optimal number of trees for Random Forest Model

score_array = [] for each in range(1,100): rf_loop = RandomForestClassifier(n_estimators = each, random_state = 1) rf_loop.fit(X_train,y_train) score_array.append(rf_loop.score(X_test,y_test)) fig = plt.figure(figsize=(15, 7)) plt.plot(range(1,100),score_array, color = '#ec838a') plt.ylabel('Range\n',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.xlabel('Score\n',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif") plt.title('Optimal Number of Trees for Random Forest Model \n',horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif") #plt.legend(loc='top right', fontsize = "medium") plt.xticks(rotation=0, horizontalalignment="center") plt.yticks(rotation=0, horizontalalignment="right") plt.show()

As we could see from the iterations above, the random forest model would attain the highest accuracy score when its n_estimators = 72.

2nd Iteration

In the second iteration of comparing baseline classification algorithms, we would be using the optimized parameters for KNN and Random Forest models. Also, we know that false negatives are more costly than false positives in a churn and hence let’s use precision, recall and F2 scores as the ideal metric for the model selection

Logistic Regression:

# Fitting Logistic Regression to the Training set classifier = LogisticRegression(random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) #Evaluate results acc = accuracy_score(y_test, y_pred ) prec = precision_score(y_test, y_pred ) rec = recall_score(y_test, y_pred ) f1 = f1_score(y_test, y_pred ) f2 = fbeta_score(y_test, y_pred, beta=2.0) results = pd.DataFrame([['Logistic Regression', acc, prec, rec, f1, f2]], columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'F2 Score']) results = results.sort_values(["Precision","Recall", "F2 Score"], ascending = False) print (results)

Support Vector Machine (linear classifier)

# Fitting SVM (SVC class) to the Training set classifier = SVC(kernel = 'linear', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) #Evaluate results acc = accuracy_score(y_test, y_pred ) prec = precision_score(y_test, y_pred ) rec = recall_score(y_test, y_pred) f1 = f1_score(y_test, y_pred ) f2 = fbeta_score(y_test, y_pred, beta=2.0) model_results = pd.DataFrame([['SVM (Linear)', acc, prec, rec, f1, f2]],columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'F2 Score']) results = results.append(model_results, ignore_index = True) results = results.sort_values(["Precision","Recall", "F2 Score"], ascending = False) print (results)

K-Nearest Neighbors:

# Fitting KNN to the Training set: classifier = KNeighborsClassifier( n_neighbors = 22, metric = 'minkowski', p = 2) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) #Evaluate results acc = accuracy_score(y_test, y_pred ) prec = precision_score(y_test, y_pred ) rec = recall_score(y_test, y_pred ) f1 = f1_score(y_test, y_pred ) f2 = fbeta_score(y_test, y_pred, beta=2.0) model_results = pd.DataFrame([['K-Nearest Neighbours', acc, prec, rec, f1, f2]], columns = ['Model','Accuracy', 'Precision', 'Recall','F1 Score', 'F2 Score']) results = results.append(model_results, ignore_index = True) results = results.sort_values(["Precision", "Recall", "F2 Score"], ascending = False) print (results)

Kernel SVM

# Fitting Kernel SVM to the Training set: classifier = SVC(kernel = 'rbf', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) #Evaluate results acc = accuracy_score(y_test, y_pred ) prec = precision_score(y_test, y_pred ) rec = recall_score(y_test, y_pred ) f1 = f1_score(y_test, y_pred ) f2 = fbeta_score(y_test, y_pred, beta=2.0) model_results = pd.DataFrame([[ 'Kernel SVM', acc, prec, rec, f1, f2]], columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'F2 Score']) results = results.append(model_results, ignore_index = True) results = results.sort_values(["Precision", "Recall", "F2 Score"], ascending = False) print (results)

Naive Byes:

# Fitting Naive Byes to the Training set: classifier = GaussianNB() classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) #Evaluate results acc = accuracy_score(y_test, y_pred ) prec = precision_score(y_test, y_pred ) rec = recall_score(y_test, y_pred ) f1 = f1_score(y_test, y_pred ) f2 = fbeta_score(y_test, y_pred, beta=2.0) model_results = pd.DataFrame([[ 'Naive Byes', acc, prec, rec, f1, f2]], columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'F2 Score']) results = results.append(model_results, ignore_index = True) results = results.sort_values(["Precision", "Recall", "F2 Score"], ascending = False) print (results)

Decision Tree:

# Fitting Decision Tree to the Training set: classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) #Evaluate results acc = accuracy_score(y_test, y_pred ) prec = precision_score(y_test, y_pred ) rec = recall_score(y_test, y_pred ) f1 = f1_score(y_test, y_pred ) f2 = fbeta_score(y_test, y_pred, beta=2.0) model_results = pd.DataFrame([[ 'Decision Tree', acc, prec, rec, f1, f2]], columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'F2 Score']) results = results.append(model_results, ignore_index = True) results = results.sort_values(["Precision", "Recall", "F2 Score"], ascending = False) print (results)

Random Forest:

# Fitting Random Forest to the Training set: classifier = RandomForestClassifier(n_estimators = 72, criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) #Evaluate results from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score acc = accuracy_score(y_test, y_pred ) prec = precision_score(y_test, y_pred ) rec = recall_score(y_test, y_pred ) f1 = f1_score(y_test, y_pred ) f2 = fbeta_score(y_test, y_pred, beta=2.0) model_results = pd.DataFrame([['Random Forest', acc, prec, rec, f1, f2]], columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'F2 Score']) results = results.append(model_results, ignore_index = True) results = results.sort_values(["Precision", "Recall", "F2 Score"], ascending = False) print (results)

From the 2nd iteration, we can definitely conclude that logistic regression is an optimal model of choice for the given dataset as it has relatively the highest combination of precision, recall and F2 scores; giving most number of correct positive predictions while minimizing the false negatives. Hence, let's try to use Logistic Regression and evaluate its performance

Section D: Model Evaluation

Train & evaluate Chosen Model

classifier = LogisticRegression(random_state = 0,penalty = 'l2') classifier.fit(X_train, y_train) # Predict the Test set results y_pred = classifier.predict(X_test) #Evaluate Model Results on Test Set: acc = accuracy_score(y_test, y_pred ) prec = precision_score(y_test, y_pred ) rec = recall_score(y_test, y_pred ) f1 = f1_score(y_test, y_pred ) f2 = fbeta_score(y_test, y_pred, beta=2.0) results = pd.DataFrame([['Logistic Regression',acc, prec, rec, f1, f2]],columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'F2 Score']) print (results)

k-Fold Cross-Validation: Model evaluation is most commonly done through ‘K- fold Cross-Validation’ technique that primarily helps us to fix the variance.

Variance problem occurs when we get good accuracy while running the model on a training set and a test set but then the accuracy looks different when the model is run on another test set.

accuracies = cross_val_score(estimator = classifier,X = X_train, y = y_train, cv = 10) print("Logistic Regression Classifier Accuracy: %0.2f (+/- %0.2f)" % (accuracies.mean(), accuracies.std() * 2))

Therefore, our k-fold Cross Validation results indicate that we would have an accuracy anywhere between 76% to 84% while running this model on any test set.

Visualize results on a Confusion Matrix

cm = confusion_matrix(y_test, y_pred) df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1)) plt.figure(figsize = (28,20)) fig, ax = plt.subplots() sn.set(font_scale=1.4) sn.heatmap(df_cm, annot=True, fmt='g' ) class_names=[0,1] tick_marks = np.arange(len(class_names)) plt.tight_layout() plt.title('Confusion matrix\n', y=1.1) plt.xticks(tick_marks, class_names) plt.yticks(tick_marks, class_names) ax.xaxis.set_label_position("top") plt.ylabel('Actual label\n') plt.xlabel('Predicted label\n')

Evaluate the model using ROC Graph

ROC Graph shows us the capability of a model to distinguish between the classes based on the AUC Mean score.

classifier.fit(X_train, y_train) probs = classifier.predict_proba(X_test) probs = probs[:, 1] classifier_roc_auc = accuracy_score(y_test, y_pred ) rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1]) plt.figure(figsize=(14, 6)) # Plot Logistic Regression ROC plt.plot(rf_fpr, rf_tpr, label='Logistic Regression (area = %0.2f)' % classifier_roc_auc) # Plot Base Rate ROC plt.plot([0,1], [0,1],label='Base Rate' 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.ylabel('True Positive Rate \n',horizontalalignment="center", fontstyle = "normal", fontsize = "medium", fontfamily = "sans-serif") plt.xlabel('\nFalse Positive Rate \n',horizontalalignment="center", fontstyle = "normal", fontsize = "medium", fontfamily = "sans-serif") plt.title('ROC Graph \n',horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif") plt.legend(loc="lower right", fontsize = "medium") plt.xticks(rotation=0, horizontalalignment="center") plt.yticks(rotation=0, horizontalalignment="right") plt.show()

Predict Feature Importance

Logistic Regression allows us to determine the key features that have significance in predicting the target attribute (“Churn” in this project).

# Analyzing Coefficients feature_importances = pd.concat([pd.DataFrame(dataset.drop(columns = 'customerID').columns, columns = ["features"]),pd.DataFrame(np.transpose(classifier.coef_), columns = ["coef"])],axis = 1) feature_importances.sort_values("coef", ascending = False)

Section E: Model Improvement

Model improvement basically involves choosing the best parameters for the machine learning model that we have come up with. There are two types of parameters in any machine learning model — the first type are the kind of parameters that the model learns; the optimal values automatically found by running the model. The second type of parameters is the ones that user get to choose while running the model. Such parameters are called the hyperparameters; a set of configurable values external to a model that cannot be determined by the data, and that we are trying to optimize through Parameter Tuning techniques like Random Search or Grid Search.

Hyper parameter Tuning via Grid Search

# Round 1: # Select Regularization Method import time penalty = ['l1', 'l2'] # Create regularization hyperparameter space C = [0.001, 0.01, 0.1, 1, 10, 100, 1000] # Combine Parameters parameters = dict(C=C, penalty=penalty) lr_classifier = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = "balanced_accuracy", cv = 10, n_jobs = -1) t0 = time.time() lr_classifier = lr_classifier .fit(X_train, y_train) t1 = time.time() print("Took %0.2f seconds" % (t1 - t0)) lr_best_accuracy = lr_classifier.best_score_ lr_best_parameters = lr_classifier.best_params_ lr_best_accuracy, lr_best_parameters

# Round 2: # Select Regularization Method import time penalty = ['l2'] # Create regularization hyperparameter space C = [ 0.0001, 0.001, 0.01, 0.02, 0.05] # Combine Parameters parameters = dict(C=C, penalty=penalty) lr_classifier = GridSearchCV(estimator = classifier, param_grid = parameters, scoring = "balanced_accuracy", cv = 10, n_jobs = -1) t0 = time.time() lr_classifier = lr_classifier .fit(X_train, y_train) t1 = time.time() print("Took %0.2f seconds" % (t1 - t0)) lr_best_accuracy = lr_classifier.best_score_ lr_best_parameters = lr_classifier.best_params_ lr_best_accuracy, lr_best_parameters

Final Hyperparameter tuning and selection

lr_classifier = LogisticRegression(random_state = 0, penalty = 'l2') lr_classifier.fit(X_train, y_train) # Predict the Test set results y_pred = lr_classifier.predict(X_test) #probability score y_pred_probs = lr_classifier.predict_proba(X_test) y_pred_probs = y_pred_probs [:, 1]

Section F: Future Predictions

Compare predictions against the test set:

#Revalidate final results with Confusion Matrix: cm = confusion_matrix(y_test, y_pred) print (cm) #Confusion Matrix as a quick Crosstab: pd.crosstab(y_test,pd.Series(y_pred),rownames=['ACTUAL'],colnames=['PRED']) #visualize Confusion Matrix: cm = confusion_matrix(y_test, y_pred) df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1)) plt.figure(figsize = (28,20)) fig, ax = plt.subplots() sn.set(font_scale=1.4) sn.heatmap(df_cm, annot=True, fmt='g' ) class_names=[0,1] tick_marks = np.arange(len(class_names)) plt.tight_layout() plt.title('Confusion matrix\n', y=1.1) plt.xticks(tick_marks, class_names) plt.yticks(tick_marks, class_names) ax.xaxis.set_label_position("top") plt.ylabel('Actual label\n') plt.xlabel('Predicted label\n') print("Test Data Accuracy: %0.4f" % accuracy_score(y_test, y_pred))

Format Final Results

Unpredictability and risk are the close companions of any predictive models. Therefore in the real world, its always a good practice to build a propensity score besides an absolute predicted outcome. Instead of just retrieving a binary estimated target outcome (0 or 1), every ‘Customer ID’ could get an additional layer of propensity score highlighting their percentage of probability to take the target action.

final_results = pd.concat([test_identity, y_test], axis = 1).dropna() final_results['predictions'] = y_pred final_results["propensity_to_churn(%)"] = y_pred_probs final_results["propensity_to_churn(%)"] = final_results["propensity_to_churn(%)"]*100 final_results["propensity_to_churn(%)"]=final_results["propensity_to_churn(%)"].round(2) final_results = final_results[['customerID', 'Churn', 'predictions', 'propensity_to_churn(%)']] final_results ['Ranking'] = pd.qcut(final_results['propensity_to_churn(%)'].rank(method = 'first'),10,labels=range(10,0,-1)) print (final_results)

Conclusion

So, in a nutshell, we made use of a customer churn dataset from Kaggle to build a machine learning classifier that predicts the propensity of any customer to churn in months to come with a reasonable accuracy score of 76% to 84%.

Classify the customers

final_results['Risk']='Low risk' final_results.loc[final_results["propensity_to_churn(%)"] > 80, 'Risk'] = 'High risk' final_results.loc[(final_results["propensity_to_churn(%)"] < 80) & (final_results["propensity_to_churn(%)"] > 60) , 'Risk'] = 'Medium risk'

result1=pd.DataFrame(final_results['Risk'].value_counts()) result1.plot.pie(y='Risk',autopct='%1.1f%%',title="Customer risk of churn")

final_results

What's next

-- Share key insights about the customer demographics and churn rate that you have garnered from the exploratory data analysis sections to the sales and marketing team of the organization. Let the sales team know the features that have positive and negative correlations with churn so that they could strategize the retention initiatives accordingly.

-- Further, classify the upcoming customers based on the propensity score as high risk (for customers with propensity score > 80%), medium risk (for customers with a propensity score between 60–80%) and lastly low-risk category (for customers with propensity score <60%). Focus on each segment of customers upfront and ensure that there needs are well taken care of.

-- Lastly, measure the return on investment (ROI) of this assignment by computing the attrition rate for the current financial quarter. Compare the quarter results with the same quarter last year or the year before and share the outcome with the senior management of your organization.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Customer Churn Analysis

Introduction

Section A: Data Preprocessing

Libraries

Dataset

Identify unique values

Target variable

Cleaning the dataset

Label Encode Binary data

Section B: Data Evaluation

Analyze the distribution of categorical variables

Analyze the churn rate by categorical variables:

Positive and negative correlations

Correlation Matrix

Multicollinearity using VIF

Encode Categorical data

Split the dataset

Feature Scaling

Section C: Model Selection

Compare Baseline Classification Algorithms

Getting the right parameters for the baseline models

Section D: Model Evaluation

Train & evaluate Chosen Model

Visualize results on a Confusion Matrix

Evaluate the model using ROC Graph

Predict Feature Importance

Section E: Model Improvement

Hyper parameter Tuning via Grid Search

Final Hyperparameter tuning and selection

Section F: Future Predictions

Compare predictions against the test set:

Format Final Results

Conclusion

Classify the customers

What's next

Customer Churn Analysis