0
0
7590-VHVEG
1
1
5575-GNVDE
2
2
3668-QPYBK
3
3
7795-CFOCW
4
4
9237-HQITU
0
7590-VHVEG
Female
1
5575-GNVDE
Male
2
3668-QPYBK
Male
3
7795-CFOCW
Male
4
9237-HQITU
Female
0
7590-VHVEG
0
1
5575-GNVDE
1
2
3668-QPYBK
1
3
7795-CFOCW
1
4
9237-HQITU
0
5
9305-CDSKC
0
6
1452-KIOVK
1
7
6713-OKOMC
0
8
7892-POOKP
0
9
6388-TABGU
1
0
-1.2802480369298874
-1.1616939352451685
1
0.06430268518835193
-0.2608779199297493
2
-1.2395040756535771
-0.3639232943754246
3
0.512486259227765
-0.747850415293989
4
-1.2395040756535771
0.19617817640187488
5
-0.9950403079957154
1.1584890119510036
6
-0.42462485012737144
0.8078023344020117
7
-0.9135523854430948
-1.1650179795821258
8
-0.18016108246950976
1.3296772953043026
9
1.2051336009250397
-0.28747027462540753
0
7590-VHVEG
0
1
5575-GNVDE
1
2
3668-QPYBK
1
3
7795-CFOCW
1
4
9237-HQITU
0
5
9305-CDSKC
0
6
1452-KIOVK
1
7
6713-OKOMC
0
8
7892-POOKP
0
9
6388-TABGU
1
What is the purpose of the above code?
Ans: Here we have dropped the original numerical values so that we can put our scaled values in the data frame. So in the next line, we are merging our scaled values (which are monthly charges, total charges and tenure)so that we can get a standard format of all our numerical column and within a range of 0 to 1.
gender
7032.0
0.5046928327645052
SeniorCitizen
7032.0
0.16240045506257111
Partner
7032.0
0.4825085324232082
Dependents
7032.0
0.2984926052332196
PhoneService
7032.0
0.9032992036405005
OnlineSecurity
7032.0
0.286547212741752
OnlineBackup
7032.0
0.34485210466439137
DeviceProtection
7032.0
0.3438566552901024
TechSupport
7032.0
0.2901023890784983
StreamingTV
7032.0
0.3843856655290102
gender
1.0
-0.00181939061341898
SeniorCitizen
-0.00181939061341898
1.0
Partner
-0.0013790513218356554
0.016956614532022372
Dependents
0.010348917127614234
-0.21055006112685165
PhoneService
-0.007514979909200452
0.00839161191121723
OnlineSecurity
-0.016327823070701356
-0.03857639016860375
OnlineBackup
-0.013092839264554514
0.06666279065142326
DeviceProtection
-0.0008067457759127595
0.05951387148203744
TechSupport
-0.008507162405233496
-0.06057683940618956
StreamingTV
-0.007124396986724548
0.10544501753679582
Q. What do you observe?
Ans: The correlation matrix above shows the correlation coefficients between several variables related to churn:
Below are few insights needs to be considered:
Tenure correlated with Total charges and Contract_Two Years
Monthly charges correlated with Totalcharges, MultipleLines, InternetService,OnlineSecurity
Total charges corelated with Tenure and Monthly charges.
Customers with multiple lines are correlated with tenure, monthly and total charges
No correlation has been observed for customer with OnlineSecurity, OnlineBackUp, DeviceProtection, TechSupport, Streamingmovies.
Strangely, 2-years contracts are corelated with tenure, but not 1-year contract
Phone service and multiple lines_no phone service are negatively correlated.
Model Building (We will build Decision Tree and Logistics Regression models)
Requirement already satisfied: statsmodels==0.13.2 in /root/venv/lib/python3.9/site-packages (0.13.2)
Requirement already satisfied: numpy>=1.17 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels==0.13.2) (1.23.3)
Requirement already satisfied: patsy>=0.5.2 in /root/venv/lib/python3.9/site-packages (from statsmodels==0.13.2) (0.5.2)
Requirement already satisfied: pandas>=0.25 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels==0.13.2) (1.2.5)
Requirement already satisfied: scipy>=1.3 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from statsmodels==0.13.2) (1.9.1)
Requirement already satisfied: packaging>=21.3 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from statsmodels==0.13.2) (21.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from packaging>=21.3->statsmodels==0.13.2) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from pandas>=0.25->statsmodels==0.13.2) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from pandas>=0.25->statsmodels==0.13.2) (2022.2.1)
Requirement already satisfied: six in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from patsy>=0.5.2->statsmodels==0.13.2) (1.16.0)
WARNING: You are using pip version 22.0.4; however, version 22.2.2 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
Q. What is the purpose of random_state parameter?
Ans: We use random_state parameter so that we could reuse the train_test_spilt data so that we can reproduce our results to get best accuracy.
Logistics Regression
0.8020477815699659
0.6888297872340425
0.5285714285714286
0.5981524249422633
Q. What do the scores mean? Is this a good model fit based on the scores. Make sure you print all the scores.
Scores are between 0 and 1, with a larger score indicating a better fit.
We can calculate scores in 4 different ways:
Accuracy is the most logical performance metric, and it is just the proportion of properly predicted observations to all observations.
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
Recall is the ratio of correctly predicted positive observations to the all observations in actual class
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account
Test Data Accuracy: 0.8020
Decision Tree
0.732650739476678
0.521551724137931
0.49387755102040815
0.5073375262054507
Q. What do the scores mean? Is this a good model fit based on the scores. Make sure you print all the scores.
Scores are between 0 and 1, with a larger score indicating a better fit.
We can calculate scores in 4 different ways:
Accuracy is the most logical performance metric, and it is just the proportion of properly predicted observations to all observations.
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
Recall is the ratio of correctly predicted positive observations to the all observations in actual class
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account
Test Data Accuracy: 0.7327
Q Which model performs better? (Hint: compare the metrics)
Ans: After comparing different models, we have observed that Logistic regression has better accuracy which is (0.80) as compared to decission tree model, this defines that the Logistic regression performs better.
K- fold Cross Validation
Q. What is K-fold cross validation?
Ans: The process contains a single parameter, k, that designates how many groups should be created from a given data sample. As a result, the process is frequently referred to as k-fold cross-validation. When a particular number for k is selected, it may be substituted for k in the model's reference, such as when k=10 is used to refer to cross-validation by a 10-fold factor.
Q. What do accuracies tell?
Ans: Accuracy is the measure of how closely a measurement resembles the actual value.
0
gender
-0.08956551403380318
1
SeniorCitizen
0.21558038363067025
2
Partner
0.002618106783772618
3
Dependents
-0.13203196164194522
4
PhoneService
-0.1786241552364546
5
OnlineSecurity
-0.427249585064374
6
OnlineBackup
-0.1889261422307553
7
DeviceProtection
-0.008706823725299377
8
TechSupport
-0.3112658540769839
9
StreamingTV
0.17963667373711875
Feature Selection/Feature Engineering
[False False False False False True False False True False True True
True False False False True True True False True False False True
False True True False True True True False True]
0.7986348122866894
0.6857923497267759
0.5122448979591837
0.5864485981308412
Test Data Accuracy: 0.7986
Q. Has the model improved after feature selection?
Ans: As per the above confusion matrix ,we can see there is not much difference between before and after selection of features .Before feature selection the accuracy was 0.73 and now it is 0.79, However, we are getting accuracy by using limited features.
Meaning, we are getting better accuracy with less features so that we can ignore other features.
0
OnlineSecurity
-0.42003843548127345
1
TechSupport
-0.2900058760206293
2
StreamingMovies
0.40358589090520386
3
PaperlessBilling
0.3830685811263155
4
MultipleLines_No
-0.44480978090478274
5
InternetService_Fiber optic
0.8628106207698113
6
InternetService_No
-0.7547700746952272
7
Contract_Month-to-month
0.7080785636158646
8
Contract_Two year
-0.7771475158855855
9
PaymentMethod_Electronic check
0.3288030210806862
This chart is empty
Q. Print the final Results
customerID Churn predicted_churn
0 3668-QPYBK 1.0 0
1 7795-CFOCW 0.0 0
2 9237-HQITU 1.0 0
3 8091-TTVAX 0.0 0
4 6865-JZNKO 0.0 0
... ... ... ...
1753 2823-LKABH 0.0 0
1754 6894-LFHLY 1.0 0
1755 0639-TSIQW 1.0 0
1756 4801-JZAZL 0.0 0
1757 3186-AJIEK 0.0 0
[1758 rows x 3 columns]
Q. Provide recommendations based on the feature selection. What should company target for to reduce churn?
Customer churn has a negative impact on a company's profitability. There are numerous tactics that can be used to reduce client churn. Knowing a company's customers well is the best strategy to prevent customer churn. This entails identifying clients who run the danger of leaving and making an effort to increase their contentment. Naturally, the primary priority for solving this problem is to improve customer service.
In order to do that, for example, organization could start loyalty program for senior citizens with some benefits so that they sould not leave the company. Another tactic would be when customer joins the organization for getting the service, company should give some additional beneficial services at the beginning itself to prevent early churning.