👋 Welcome

# Start writing code here...# Import all the libraries import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt#visualization from PIL import Image %matplotlib inline import seaborn as sns#visualization import itertools import warnings warnings.filterwarnings("ignore") import plotly.graph_objs as go#visualization import plotly.offline as py#visualization

# Load the cleaned file telcom = pd.read_csv("clean_df.csv")

telcom.head()

telcom = telcom.drop('Unnamed: 0', axis = 1)

telcom.head()

from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import StandardScaler #customer id col Id_col = ['customerID'] #Target columns target_col = ["Churn"] #categorical columns cat_cols = telcom.nunique()[telcom.nunique() < 6].keys().tolist() cat_cols = [x for x in cat_cols if x not in target_col] #numerical columns num_cols = [x for x in telcom.columns if x not in cat_cols + target_col + Id_col] #Binary columns with 2 values bin_cols = telcom.nunique()[telcom.nunique() == 2].keys().tolist() #Columns more than 2 values multi_cols = [i for i in cat_cols if i not in bin_cols]

multi_cols

bin_cols

#Label encoding Binary columns le = LabelEncoder() for i in bin_cols : telcom[i] = le.fit_transform(telcom[i]) #Duplicating columns for multi value columns telcom = pd.get_dummies(data = telcom,columns = multi_cols)

telcom

#Scaling Numerical columns std = StandardScaler() scaled = std.fit_transform(telcom[num_cols]) scaled = pd.DataFrame(scaled,columns=num_cols)

scaled

telcom = telcom.drop(columns = num_cols,axis = 1) telcom = telcom.merge(scaled,left_index=True,right_index=True,how = "left")

telcom

What is the purpose of the above code?

Ans: Here we have dropped the original numerical values so that we can put our scaled values in the data frame. So in the next line, we are merging our scaled values (which are monthly charges, total charges and tenure)so that we can get a standard format of all our numerical column and within a range of 0 to 1.

summary = telcom.describe().transpose() summary

# To make it look nice we can do some additional stuff if needed summary = (telcom[[i for i in telcom.columns if i not in Id_col]]. describe().transpose().reset_index()) summary = summary.rename(columns = {"index" : "feature"}) summary = np.around(summary,3) val_lst = [summary['feature'], summary['count'], summary['mean'],summary['std'], summary['min'], summary['25%'], summary['50%'], summary['75%'], summary['max']] trace = go.Table(header = dict(values = summary.columns.tolist(), line = dict(color = ['#506784']), fill = dict(color = ['#119DFF']), ), cells = dict(values = val_lst, line = dict(color = ['#506784']), fill = dict(color = ["lightgrey",'#F5F8FF']) ), columnwidth = [200,60,100,100,60,60,80,80,80]) layout = go.Layout(dict(title = "Variable Summary")) figure = go.Figure(data=[trace],layout=layout) py.iplot(figure)

#correlation correlation = telcom.corr() #tick labels matrix_cols = correlation.columns.tolist()

#convert to array corr_array = np.array(correlation)

correlation

#Plotting trace = go.Heatmap(z = corr_array, x = matrix_cols, y = matrix_cols, colorscale = "Viridis", colorbar = dict(title = "Pearson Correlation coefficient", titleside = "right" ) , ) layout = go.Layout(dict(title = "Correlation Matrix for variables", autosize = False, height = 720, width = 800, margin = dict(r = 0 ,l = 210, t = 25,b = 210, ), yaxis = dict(tickfont = dict(size = 9)), xaxis = dict(tickfont = dict(size = 9)) ) ) data = [trace] fig = go.Figure(data=data,layout=layout) py.iplot(fig)

Q. What do you observe?

Ans: The correlation matrix above shows the correlation coefficients between several variables related to churn:

Below are few insights needs to be considered:

Tenure correlated with Total charges and Contract_Two Years

Monthly charges correlated with Totalcharges, MultipleLines, InternetService,OnlineSecurity

Total charges corelated with Tenure and Monthly charges.

Customers with multiple lines are correlated with tenure, monthly and total charges

No correlation has been observed for customer with OnlineSecurity, OnlineBackUp, DeviceProtection, TechSupport, Streamingmovies.

Strangely, 2-years contracts are corelated with tenure, but not 1-year contract

Phone service and multiple lines_no phone service are negatively correlated.

Model Building (We will build Decision Tree and Logistics Regression models)

!pip install statsmodels==0.13.2

# Import all the modules from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import confusion_matrix,accuracy_score,classification_report from sklearn.metrics import f1_score import statsmodels.api as sm from sklearn.metrics import precision_score,recall_score

#splitting train and test data train,test = train_test_split(telcom,test_size = .25 ,random_state = 111)

Q. What is the purpose of random_state parameter?

Ans: We use random_state parameter so that we could reuse the train_test_spilt data so that we can reproduce our results to get best accuracy.

##seperating dependent and independent variables cols = [i for i in telcom.columns if i not in Id_col + target_col] X_train = train[cols] Y_train = train[target_col] X_test = test[cols] Y_test = test[target_col]

Logistics Regression

classifier = LogisticRegression(random_state = 0,penalty='l2') classifier.fit(X_train, Y_train)

# Predicting test set y_pred = classifier.predict(X_test)

classifier.coef_

#Evaluating the Results cm = confusion_matrix(Y_test, y_pred) print(accuracy_score(Y_test, y_pred)) print(precision_score(Y_test, y_pred)) print(recall_score(Y_test, y_pred)) print(f1_score(Y_test, y_pred))

Q. What do the scores mean? Is this a good model fit based on the scores. Make sure you print all the scores.

Scores are between 0 and 1, with a larger score indicating a better fit.

We can calculate scores in 4 different ways:

Accuracy is the most logical performance metric, and it is just the proportion of properly predicted observations to all observations.

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.

Recall is the ratio of correctly predicted positive observations to the all observations in actual class

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account

# Build the confusion matrix df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1)) plt.figure(figsize = (10, 7)) sns.heatmap(df_cm, annot = True, fmt ='g') print("Test Data Accuracy: %.4f" %accuracy_score(Y_test, y_pred))

Decision Tree

model_tree = DecisionTreeClassifier(random_state = 2) model_tree.fit(X_train, Y_train)

# Predicting test set y_pred = model_tree.predict(X_test)

#Evaluating the Results cm = confusion_matrix(Y_test, y_pred) print(accuracy_score(Y_test, y_pred)) print(precision_score(Y_test, y_pred)) print(recall_score(Y_test, y_pred)) print(f1_score(Y_test, y_pred))

Q. What do the scores mean? Is this a good model fit based on the scores. Make sure you print all the scores.

Scores are between 0 and 1, with a larger score indicating a better fit.

We can calculate scores in 4 different ways:

Accuracy is the most logical performance metric, and it is just the proportion of properly predicted observations to all observations.

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.

Recall is the ratio of correctly predicted positive observations to the all observations in actual class

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account

# Build Confusion Matrix df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1)) plt.figure(figsize = (10, 7)) sns.heatmap(df_cm, annot = True, fmt ='g') print("Test Data Accuracy: %.4f" %accuracy_score(Y_test, y_pred))

Q Which model performs better? (Hint: compare the metrics)

Ans: After comparing different models, we have observed that Logistic regression has better accuracy which is (0.80) as compared to decission tree model, this defines that the Logistic regression performs better.

K- fold Cross Validation

Q. What is K-fold cross validation?

Ans: The process contains a single parameter, k, that designates how many groups should be created from a given data sample. As a result, the process is frequently referred to as k-fold cross-validation. When a particular number for k is selected, it may be substituted for k in the model's reference, such as when k=10 is used to refer to cross-validation by a 10-fold factor.

from sklearn.model_selection import cross_val_score accuracies = cross_val_score(estimator = classifier, X= X_train, y =Y_train, cv =10)

# Check for accuracies accuracies

Q. What do accuracies tell?

Ans: Accuracy is the measure of how closely a measurement resembles the actual value.

pd.concat([pd.DataFrame(X_train.columns, columns =["features"]), pd.DataFrame(np.transpose(classifier.coef_), columns =["Coef"])], axis = 1)

Feature Selection/Feature Engineering

from sklearn.feature_selection import RFE classifier = LogisticRegression() rfe = RFE(classifier,step=10) rfe = rfe.fit(X_train,Y_train)

print(rfe.support_)

X_train.columns[rfe.support_]

rfe.ranking_

# Build the model again after Feature Selection classifier = LogisticRegression(random_state = 2) classifier.fit(X_train[X_train.columns[rfe.support_]], Y_train)

# Predicting test set y_pred = classifier.predict(X_test[X_test.columns[rfe.support_]])

#Evaluating the Results cm = confusion_matrix(Y_test, y_pred) print(accuracy_score(Y_test, y_pred)) print(precision_score(Y_test, y_pred)) print(recall_score(Y_test, y_pred)) print(f1_score(Y_test, y_pred))

df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1)) plt.figure(figsize = (10, 7)) sns.heatmap(df_cm, annot = True, fmt ='g') print("Test Data Accuracy: %.4f" %accuracy_score(Y_test, y_pred))

Q. Has the model improved after feature selection?

Ans: As per the above confusion matrix ,we can see there is not much difference between before and after selection of features .Before feature selection the accuracy was 0.73 and now it is 0.79, However, we are getting accuracy by using limited features.

Meaning, we are getting better accuracy with less features so that we can ignore other features.

# Subset the coefficents for RFE pd.concat([pd.DataFrame(X_train.columns[rfe.support_], columns =["features"]), pd.DataFrame(np.transpose(classifier.coef_), columns =["Coef"])], axis = 1)

This chart is empty

Chart was probably not set up properly in the notebook

final_results = pd.concat([Y_test, telcom.customerID], axis =1).dropna() final_results['predicted_churn'] = y_pred final_results = final_results[['customerID', 'Churn', 'predicted_churn']].reset_index(drop = True)

Q. Print the final Results

print(final_results)

Q. Provide recommendations based on the feature selection. What should company target for to reduce churn?

Customer churn has a negative impact on a company's profitability. There are numerous tactics that can be used to reduce client churn. Knowing a company's customers well is the best strategy to prevent customer churn. This entails identifying clients who run the danger of leaving and making an effort to increase their contentment. Naturally, the primary priority for solving this problem is to improve customer service.

In order to do that, for example, organization could start loyalty program for senior citizens with some benefits so that they sould not leave the company. Another tactic would be when customer joins the organization for getting the service, company should give some additional beneficial services at the beginning itself to prevent early churning.