draft_IEOR142_Midterm_Part2

Data Analysis Exercises #2

import pandas as pd import numpy as np

insurance_train = pd.read_csv('insurance_train.csv') insurance_test = pd.read_csv('insurance_test.csv')

insurance_train.head()

#function for calculating accuracy def accuracy(tn, fp, fn, tp): return (tn+tp)/(tn+fp+fn+tp)

#function to calculate estimated loss with the outcome. The number 2493 is from the previous section def total_loss(fn, fp, tp): return fn*3000 + fp*2493 + tp*500

i) Logistic Regression

!pip install statsmodels==0.13.0

import os import statsmodels.formula.api as smf

#1--logistic regression model with all variables logreg1 = smf.logit(formula='Does_Not_Buy ~ Gender + Age + Driving_License + Previously_Insured + Vehicle_Age + Vehicle_Damage + Annual_Premium + Vintage', data = insurance_train).fit() print(logreg1.summary())

The most significant variables according to p-value are Gender, Vehicle_Age, Vehicle_Damage, Age, Previously_Insured, and Annual_Premium. For example, having driving license has a strong reverse correlation with the customers not buying the vehicle insurance. If an individual has a drivers license, then they are e^(-0.7474) = 0.47 times likely to NOT buy the insurance (meaning they are more likely to buy vehicle insurance).

#refer to the p threshold I've got, which is 0.416 #test the quality of the model y_prob1 = logreg1.predict(insurance_test) y_pred1 = pd.Series([1 if X > 0.416 else 0 for X in y_prob1], index=y_prob1.index)

#create confusion matrix from sklearn.metrics import confusion_matrix y_test = insurance_test['Does_Not_Buy'] cm1 = confusion_matrix(y_test, y_pred1) tn, fp, fn, tp = cm1.ravel() print("Confusion Matrix : \n", cm1) print(accuracy(tn, fp, fn, tp))

total_loss(fn, fp, tp)

Accuracy of this model is 0.73, and the total estimated loss from this model is $12,202,423. On the summary cell, Age and Annual_Premium seem to have some of the lowest absolute coefficient value despite having a low p-value. Below is the ROC curve to further examine the difference in quality from the baseline model.

#draw ROC curve to further analyze the quality of the model import matplotlib.pyplot as plt from sklearn.metrics import roc_curve, auc fpr, tpr, _ = roc_curve(y_test, y_prob1) roc_auc = auc(fpr, tpr) plt.figure(figsize=(8, 6)) plt.title('ROC Curve', fontsize=18) plt.xlabel('FPR', fontsize=16) plt.ylabel('TPR', fontsize=16) plt.xlim([-0.01, 1.00]) plt.ylim([-0.01, 1.01]) plt.plot(fpr, tpr, lw=3, label='Logistic Regression (area = {:0.2f})'.format(roc_auc)) plt.plot([0,1],[0,1], color='navy', lw=3, linestyle='--', label='Naive Baseline (area = 0.50)') plt.legend(loc='lower right', fontsize=14) plt.show()

#assess the performance on the test set using OSR^2 def OSR2(model, df_train, df_test, dependent_var): y_test = df_test[dependent_var] y_pred = model.predict(df_test) SSE = np.sum((y_test - y_pred)**2) SST = np.sum((y_test - np.mean(df_train[dependent_var]))**2) return 1 - SSE/SST OSR2(logreg1, insurance_train, insurance_test, 'Does_Not_Buy')

Finally, I used OSR^2 score to test the quality of the model using the test set. The OSR^2 score is 0.309, which is significantly lower than the R^2 score from the model. This may indicate that, without variable selection, the logistic regression model may be overfitted.

ii) CART

from sklearn.tree import DecisionTreeClassifier import matplotlib.pyplot as plt from sklearn.tree import plot_tree from sklearn.metrics import confusion_matrix from sklearn.metrics import precision_score from sklearn.metrics import recall_score

#check if there's anything to dummy encode insurance_train.info()

#dummy encode Gender and Vehicle_Damage, and Vehicle_Age train_enc = pd.get_dummies(insurance_train, columns = ['Gender','Vehicle_Age','Vehicle_Damage'], drop_first=True) test_enc = pd.get_dummies(insurance_train, columns = ['Gender','Vehicle_Age','Vehicle_Damage'], drop_first=True) train_enc.info()

train_enc.head()

#split into x and y train x_train = train_enc.drop(['Does_Not_Buy'], axis=1) y_train = train_enc['Does_Not_Buy'] x_test = test_enc.drop(['Does_Not_Buy'], axis=1) y_test = test_enc['Does_Not_Buy']

#Cross validation to find the optimal ccp_alpha. from sklearn.model_selection import GridSearchCV grid_values = {'ccp_alpha': np.linspace(0, 0.10, 201), 'min_samples_leaf': [5], 'min_samples_split': [20], 'max_depth': [30], 'random_state': [88]} dtc = DecisionTreeClassifier() dtc_cv_acc = GridSearchCV(dtc, param_grid = grid_values, scoring = 'accuracy', cv=10, verbose=1) dtc_cv_acc.fit(x_train, y_train)

#top 20 ccp_alpha without the penalty weights acc = dtc_cv_acc.cv_results_['mean_test_score'] ccp = dtc_cv_acc.cv_results_['param_ccp_alpha'].data pd.DataFrame({'ccp alpha' : ccp, 'Validation Accuracy': acc}).head(20)

The first CART model (no penalties reflected) with ccp_alpha value of 0.0 has the accuracy of 0.804.

#Cross validation to find the optimal ccp_alpha. This time, included class weight to reflect specific penalties to some extent grid_values = {'ccp_alpha': np.linspace(0, 0.10, 201), 'class_weight': [{0:1, 1:3000}], 'min_samples_leaf': [5], 'min_samples_split': [20], 'max_depth': [30], 'random_state': [88]} dtc = DecisionTreeClassifier() dtc_cv_acc = GridSearchCV(dtc, param_grid = grid_values, scoring = 'accuracy', cv=10, verbose=1) dtc_cv_acc.fit(x_train, y_train)

acc = dtc_cv_acc.cv_results_['mean_test_score'] ccp = dtc_cv_acc.cv_results_['param_ccp_alpha'].data pd.DataFrame({'ccp alpha' : ccp, 'Validation Accuracy': acc}).head(20)

With the weight penalties included, all top 20 of the ccp_alpha values share the same accuracy: 0.6939. The accuracy itself is lower than the model without the weight.

#model without weight & ccp_alpha = 0.0095 dtc_final = DecisionTreeClassifier(min_samples_leaf=5, ccp_alpha=0.0095, random_state = 88) dtc_final = dtc_final.fit(x_train, y_train) print('Node count =', dtc_final.tree_.node_count) plt.figure(figsize=(11,9)) plot_tree(dtc_final, feature_names=x_train.columns, class_names=['0','1'], filled=True, impurity=False, rounded=True, fontsize=12) plt.show() y_pred = dtc_final.predict(x_test) cm = confusion_matrix(y_test, y_pred) print ("Confusion Matrix : \n", cm) # print('Precision:',precision_score(y_test, y_pred)) # print('Recall:',recall_score(y_test, y_pred)) acc = (cm.ravel()[0]+cm.ravel()[3])/sum(cm.ravel()) TPR = cm.ravel()[3]/(cm.ravel()[3]+cm.ravel()[2]) FPR = cm.ravel()[1]/(cm.ravel()[1]+cm.ravel()[0]) print('Accuracy is: %.4f' %acc) print('TPR is: %.4f' % TPR) print('FPR is: %.4f' % FPR)

total_loss(8323, 1264, 16717)

The accuracy of the model with ccp_alpha = 0.0095 is 0.7343, and the model would lose total of 36,478,652 dollars.

#model penalty weight & ccp_alpha = 0.00 dtc_final = DecisionTreeClassifier(min_samples_leaf=5, ccp_alpha=0.00, class_weight = {0:1, 1:3000}, random_state = 88) dtc_final = dtc_final.fit(x_train, y_train) print('Node count =', dtc_final.tree_.node_count) plt.figure(figsize=(11,9)) plot_tree(dtc_final, feature_names=x_train.columns, class_names=['0','1'], filled=True, impurity=False, rounded=True, fontsize=12) plt.show() y_pred = dtc_final.predict(x_test) cm = confusion_matrix(y_test, y_pred) print ("Confusion Matrix : \n", cm) # print('Precision:',precision_score(y_test, y_pred)) # print('Recall:',recall_score(y_test, y_pred)) acc = (cm.ravel()[0]+cm.ravel()[3])/sum(cm.ravel()) TPR = cm.ravel()[3]/(cm.ravel()[3]+cm.ravel()[2]) FPR = cm.ravel()[1]/(cm.ravel()[1]+cm.ravel()[0]) print('Accuracy is: %.4f' %acc) print('TPR is: %.4f' % TPR) print('FPR is: %.4f' % FPR)

As the plot above shows, there are more than 5800 nodes in this particular CART model. Though the accuracy is higher in this model than one without the penalty weights, the number of nodes makes the model uninterpretable. Let's compare the total loss of these two versions of CART model to finalize which one is better.

#total loss of CART model with penalty weights and ccp_alpha=0.00 total_loss(0, 6149, 25040)

The total loss of this model is slightly less, therefore, out of the two models, CART model with penalty weights and ccp_alpha=0.0 seems to have a better quality.

Conclusion: The total loss of logistic regression model is 12,202,423 dollars while the total loss of CART model is 27,849,457. From these values, I'd recommend using the logistic regression model in classifying potential customers to offer the free additional service.

c) 0.5 <= a <= 0.7 with logistic regression model

#First, see how much p is influenced when alpha changes alpha = np.array(np.arange(0.5,0.7,0.01)) pea = -1/(6*(alpha - 1)) plt.title('Alpha vs. P-value (probability of NOT buying)') plt.xlabel('Alpha') plt.ylabel('P-value') plt.plot(alpha, pea) plt.show()

pea = np.array(-1/(6*(alpha - 1))) pea

As the alpha increases, the threshold probability of customer NOT buying the vehicle insurance increases.

#Examine the change of sensitivity with the range of alpha y_test = insurance_test['Does_Not_Buy'] sensitivities = [] for each in pea: y_prob = logreg1.predict(insurance_test) y_pred = pd.Series([1 if X > each else 0 for X in y_prob], index=y_prob.index) cm = confusion_matrix(y_test, y_pred) tn, fp, fn, tp = cm.ravel() sensitivity = tp/(tp+fn) sensitivities.append(sensitivity)

plt.title('Change of Alpha and Sensitivities') plt.xlabel('Alpha') plt.ylabel('Sensitivity') plt.plot(alpha, sensitivities) plt.show()

As the alpha value increases, the sensitivity of the model decreases from close to 1 to around 0.75. This may indicate that the number of customers that the model pays for $500 additional service and ends up losing $3000 increases. My best educational guess for the ultimate determination is to find the infliction point of the graph above and set that as the alpha value, then proceed with completing the model.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Data Analysis Exercises #2

i) Logistic Regression

ii) CART

c) 0.5 &lt;= a &lt;= 0.7 with logistic regression model

Data Analysis Exercises #2

c) 0.5 <= a <= 0.7 with logistic regression model