Data Analysis Exercises #2
i) Logistic Regression
The most significant variables according to p-value are Gender, Vehicle_Age, Vehicle_Damage, Age, Previously_Insured, and Annual_Premium. For example, having driving license has a strong reverse correlation with the customers not buying the vehicle insurance. If an individual has a drivers license, then they are e^(-0.7474) = 0.47 times likely to NOT buy the insurance (meaning they are more likely to buy vehicle insurance).
Accuracy of this model is 0.73, and the total estimated loss from this model is $12,202,423. On the summary cell, Age and Annual_Premium seem to have some of the lowest absolute coefficient value despite having a low p-value. Below is the ROC curve to further examine the difference in quality from the baseline model.
Finally, I used OSR^2 score to test the quality of the model using the test set. The OSR^2 score is 0.309, which is significantly lower than the R^2 score from the model. This may indicate that, without variable selection, the logistic regression model may be overfitted.
ii) CART
The first CART model (no penalties reflected) with ccp_alpha value of 0.0 has the accuracy of 0.804.
With the weight penalties included, all top 20 of the ccp_alpha values share the same accuracy: 0.6939. The accuracy itself is lower than the model without the weight.
The accuracy of the model with ccp_alpha = 0.0095 is 0.7343, and the model would lose total of 36,478,652 dollars.
As the plot above shows, there are more than 5800 nodes in this particular CART model. Though the accuracy is higher in this model than one without the penalty weights, the number of nodes makes the model uninterpretable. Let's compare the total loss of these two versions of CART model to finalize which one is better.
The total loss of this model is slightly less, therefore, out of the two models, CART model with penalty weights and ccp_alpha=0.0 seems to have a better quality.
Conclusion: The total loss of logistic regression model is 12,202,423 dollars while the total loss of CART model is 27,849,457. From these values, I'd recommend using the logistic regression model in classifying potential customers to offer the free additional service.
c) 0.5 <= a <= 0.7 with logistic regression model
As the alpha increases, the threshold probability of customer NOT buying the vehicle insurance increases.
As the alpha value increases, the sensitivity of the model decreases from close to 1 to around 0.75. This may indicate that the number of customers that the model pays for $500 additional service and ends up losing $3000 increases. My best educational guess for the ultimate determination is to find the infliction point of the graph above and set that as the alpha value, then proceed with completing the model.