Data Analysis Exercises #2
i) Logistic Regression
Collecting statsmodels==0.13.0
Downloading statsmodels-0.13.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)
|████████████████████████████████| 9.8 MB 12.3 MB/s
Requirement already satisfied: pandas>=0.25 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.0) (1.2.5)
Requirement already satisfied: scipy>=1.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.0) (1.7.1)
Collecting patsy>=0.5.2
Downloading patsy-0.5.2-py2.py3-none-any.whl (233 kB)
|████████████████████████████████| 233 kB 33.2 MB/s
Requirement already satisfied: numpy>=1.17 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.0) (1.19.5)
Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas>=0.25->statsmodels==0.13.0) (2021.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from pandas>=0.25->statsmodels==0.13.0) (2.8.2)
Requirement already satisfied: six in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from patsy>=0.5.2->statsmodels==0.13.0) (1.16.0)
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.5.2 statsmodels-0.13.0
WARNING: You are using pip version 21.2.4; however, version 21.3 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
Optimization terminated successfully.
Current function value: 0.417442
Iterations 10
Logit Regression Results
==============================================================================
Dep. Variable: Does_Not_Buy No. Observations: 36083
Model: Logit Df Residuals: 36073
Method: MLE Df Model: 9
Date: Fri, 22 Oct 2021 Pseudo R-squ.: 0.3222
Time: 03:29:56 Log-Likelihood: -15063.
converged: True LL-Null: -22223.
Covariance Type: nonrobust LLR p-value: 0.000
============================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 1.8001 0.314 5.738 0.000 1.185 2.415
Gender[T.Male] -0.1550 0.028 -5.486 0.000 -0.210 -0.100
Vehicle_Age[T.< 1 Year] 1.2106 0.046 26.357 0.000 1.121 1.301
Vehicle_Age[T.> 2 Years] -0.1607 0.050 -3.234 0.001 -0.258 -0.063
Vehicle_Damage[T.Yes] -1.9972 0.073 -27.297 0.000 -2.141 -1.854
Age 0.0211 0.001 15.359 0.000 0.018 0.024
Driving_License -0.7474 0.293 -2.548 0.011 -1.322 -0.172
Previously_Insured 3.8464 0.159 24.205 0.000 3.535 4.158
Annual_Premium -0.0003 5.74e-05 -5.146 0.000 -0.000 -0.000
Vintage -0.0004 0.000 -2.213 0.027 -0.001 -4.15e-05
============================================================================================
The most significant variables according to p-value are Gender, Vehicle_Age, Vehicle_Damage, Age, Previously_Insured, and Annual_Premium. For example, having driving license has a strong reverse correlation with the customers not buying the vehicle insurance. If an individual has a drivers license, then they are e^(-0.7474) = 0.47 times likely to NOT buy the insurance (meaning they are more likely to buy vehicle insurance).
Confusion Matrix :
[[1157 2511]
[ 705 7655]]
0.7326238776188893
Accuracy of this model is 0.73, and the total estimated loss from this model is $12,202,423. On the summary cell, Age and Annual_Premium seem to have some of the lowest absolute coefficient value despite having a low p-value. Below is the ROC curve to further examine the difference in quality from the baseline model.
Finally, I used OSR^2 score to test the quality of the model using the test set. The OSR^2 score is 0.309, which is significantly lower than the R^2 score from the model. This may indicate that, without variable selection, the logistic regression model may be overfitted.
ii) CART
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36083 entries, 0 to 36082
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 36083 non-null object
1 Age 36083 non-null int64
2 Driving_License 36083 non-null int64
3 Previously_Insured 36083 non-null int64
4 Vehicle_Age 36083 non-null object
5 Vehicle_Damage 36083 non-null object
6 Annual_Premium 36083 non-null float64
7 Vintage 36083 non-null int64
8 Does_Not_Buy 36083 non-null int64
dtypes: float64(1), int64(5), object(3)
memory usage: 2.5+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36083 entries, 0 to 36082
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 36083 non-null int64
1 Driving_License 36083 non-null int64
2 Previously_Insured 36083 non-null int64
3 Annual_Premium 36083 non-null float64
4 Vintage 36083 non-null int64
5 Does_Not_Buy 36083 non-null int64
6 Gender_Male 36083 non-null uint8
7 Vehicle_Age_< 1 Year 36083 non-null uint8
8 Vehicle_Age_> 2 Years 36083 non-null uint8
9 Vehicle_Damage_Yes 36083 non-null uint8
dtypes: float64(1), int64(5), uint8(4)
memory usage: 1.8 MB
Fitting 10 folds for each of 201 candidates, totalling 2010 fits
The first CART model (no penalties reflected) with ccp_alpha value of 0.0 has the accuracy of 0.804.
Fitting 10 folds for each of 201 candidates, totalling 2010 fits
With the weight penalties included, all top 20 of the ccp_alpha values share the same accuracy: 0.6939. The accuracy itself is lower than the model without the weight.
Node count = 5
Confusion Matrix :
[[ 9779 1264]
[ 8323 16717]]
Accuracy is: 0.7343
TPR is: 0.6676
FPR is: 0.1145
The accuracy of the model with ccp_alpha = 0.0095 is 0.7343, and the model would lose total of 36,478,652 dollars.
Node count = 5827
Confusion Matrix :
[[ 4894 6149]
[ 0 25040]]
Accuracy is: 0.8296
TPR is: 1.0000
FPR is: 0.5568
As the plot above shows, there are more than 5800 nodes in this particular CART model. Though the accuracy is higher in this model than one without the penalty weights, the number of nodes makes the model uninterpretable. Let's compare the total loss of these two versions of CART model to finalize which one is better.
The total loss of this model is slightly less, therefore, out of the two models, CART model with penalty weights and ccp_alpha=0.0 seems to have a better quality.
Conclusion: The total loss of logistic regression model is 12,202,423 dollars while the total loss of CART model is 27,849,457. From these values, I'd recommend using the logistic regression model in classifying potential customers to offer the free additional service.
c) 0.5 <= a <= 0.7 with logistic regression model
As the alpha increases, the threshold probability of customer NOT buying the vehicle insurance increases.
As the alpha value increases, the sensitivity of the model decreases from close to 1 to around 0.75. This may indicate that the number of customers that the model pays for $500 additional service and ends up losing $3000 increases. My best educational guess for the ultimate determination is to find the infliction point of the graph above and set that as the alpha value, then proceed with completing the model.