IEOR 142 HW 5 Pt. B-C

Part B

import pandas as pd import numpy as np import matplotlib.pyplot as plt

To prevent Kernel from dying, I have decided to modify 'stack_stats_final.csv' from Part A behind the scene and upload the csv files that are ready to be used for Part B. Here are the steps I took to get to 'test.csv' and 'train.csv' below: 1) retokenize 'stack_stats_final.csv', 2) add 'isUseful' column, a binary column with value 1 indicating that the question is useful ('Score' greater than or equal to 1), 3) split into 'train.csv' and 'test.csv' based on the ID values of the provided train and test datasets.

test = pd.read_csv('test.csv') train = pd.read_csv('train.csv')

#make accuracy function to calculate accuracy def accuracy(matrix_result): tn, fp, fn, tp = matrix_result return (tn + tp)/(tn + fp + fn + tp)

#select feature_cols because 'Unnamed: 0', 'Id', and 'Score' should not influence the model feature_cols = train.columns[3:621] x_train = train[feature_cols] x_test = test[feature_cols] y_train = train.isUseful y_test = test['isUseful']

Logistic Regression

!pip install statsmodels==0.13.0

import os import statsmodels.api as sm

#build a logistic model logreg = sm.Logit(y_train, x_train).fit() print(logreg.summary())

y_prob_logreg = logreg.predict(x_test) y_pred_logreg = pd.Series([1 if X > 1/2 else 0 for X in y_prob_logreg], index=y_prob_logreg.index)

from sklearn.metrics import confusion_matrix logreg_confusion = confusion_matrix(y_test, y_pred_logreg).ravel()

accuracy(logreg_confusion)

The accuracy of the Logistic Regression using all features from the dataframe is 0.528. In this section, I have decided to include all features because the words included in the questions are already filtered in Part A.

LDA

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis lda = LinearDiscriminantAnalysis() lda.fit(x_train, y_train)

y_pred_lda = lda.predict(x_test)

from sklearn.metrics import confusion_matrix lda_confusion = confusion_matrix(y_test, y_pred_lda).ravel()

accuracy(lda_confusion)

The accuracy of the LDA model from this trial is 0.529, just slightly higher than Logistic Regression.

CART

from sklearn.metrics import make_scorer from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import GridSearchCV

#cross-validation to get the optimal ccp alpha #fits reduced from 2010 (demonstrated in lab) to 110 & max_depth from 30 to 20 for sake of time grid_values = {'ccp_alpha':np.linspace(0, 0.10, 11), 'min_samples_leaf': [5], 'min_samples_split': [20], 'max_depth': [20], 'random_state': [88]} dtc = DecisionTreeClassifier() dtc_cv_acc = GridSearchCV(dtc, param_grid = grid_values, cv = 10, verbose = 1, scoring = 'accuracy') dtc_cv_acc.fit(x_train, y_train)

#derive the top 10 ccp_alpha values acc = dtc_cv_acc.cv_results_['mean_test_score'] ccp = dtc_cv_acc.cv_results_['param_ccp_alpha'].data pd.DataFrame({'ccp lpha': ccp, 'Validation Accuracy': acc}).head(10)

#since the granularity of the cross validation above is not optimal, #I've set the assumed opimal ccp_alpha to 0.0005 (between 0.0 and 0.01 & from example in lab) #run the model on test set dtc_test = DecisionTreeClassifier(min_samples_leaf=5, ccp_alpha=0.0005, random_state=88) dtc_test = dtc_test.fit(x_train, y_train) y_pred_cart = dtc_test.predict(x_test) cart_confusion = confusion_matrix(y_test, y_pred_cart).ravel() accuracy(cart_confusion)

The accuracy of CART mode is 0.527. Disclaimer that the optimal ccp-alpha may not be accurate as the granularity of the cross validation linspace is more rough than the default setting.

Random Forests

x_train.info()

#omit cross-validation for sake of time from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier() rf = rf.fit(x_train, y_train)

y_pred_rf = rf.predict(x_test) rf_confusion = confusion_matrix(y_test, y_pred_rf).ravel() accuracy(rf_confusion)

The accuracy of Random Forest model is 0.500, which is the lowest of all so far. However, it is too early to conclude this model's qualification as we haven't gone through bootstrapping to validate models.

Boosting

from sklearn.ensemble import GradientBoostingClassifier gbr = GradientBoostingClassifier(n_estimators=3300, max_leaf_nodes=10) gbr = gbr.fit(x_train, y_train)

y_pred_boost = gbr.predict(x_test) boost_confusion = confusion_matrix(y_test, y_pred_boost).ravel() accuracy(boost_confusion)

The accuracy of the Boosting model in this trial is 0.514.

Bootstrapping for final model

After calculating the quality of the models using "accuracy", I've started to believe that TPR may be the better metrics for this scenario as we aim to maximize the number of useful questions popping up at the top. Below is my attempt to apply that; note that the visualizations are not present as I wasn't able to solve the error "bootstrap_validation() got multiple values for argument 'metrics_list'" (shown under Boostrap for LDA section). Sample size is minimized to 500 to prevent the file from running too long. The code for all models are still shown here.

#code from Lab 8b for bootstrapping import time def bootstrap_validation(test_data, test_label, model, metrics_list, sample=500, random_state=66): tic = time.time() n_sample = sample n_metrics = len(metrics_list) output_array = np.zeros([n_sample, n_metrics]) output_array[:] = np.nan print(output_array.shape) for bs_iter in range(n_sample): bs_index = np.random.choice(test_data.index, len(test_data.index), replace=True) bs_data = test_data.loc[bs_index] bs_label = test_label.loc[bs_index] bs_predicted = model.predict(bs_data) for metrics_iter in range(n_metrics): metrics = metrics_list[metrics_iter] output_array[bs_iter, metrics_iter] = metrics(bs_predicted, bs_label) output_df = pd.DataFrame(output_array) return output_df

#create function for the metric I am using (TPR) def tpr(predictions): tn, fp, fn, tp = confusion_matrix(y_test, predictions) return tp/(tp + fn) def fpr(predictions): tn, fp, fn, tp = confusion_matrix(y_test, predictions) return fp/(fp + tn)

Bootstrap for Logistic Regression

bs_output_logreg = bootstrap_validation(x_test, y_test, y_train, logreg, metrics_list = [tpr], sample = 500)

#TPR distribution for Logistic Regression logreg_tpr = tpr(y_pred_logreg) fix, axs = plt.subplots(ncols=2, figsize=(12,5)) axs[1].set_xlabel('Logreg TPR - Test Set TPR', fontsize = 16) axs[0].set_ylabel('Count', fontsize = 16) axs[0].hist(bs_output_logreg.iloc[:, 0], bins = 20, edgecolor = 'green', linewidth = 2, color = 'grey') axs[0].set_xlim([0.4, 0.7]) axs[1].hist(bs_output_logreg.iloc[:,0] - logreg_tpr, bins = 20, edgecolor = 'green', linewidth = 2, color = 'grey') axs[1].set_xlim([-0.15, 0.15])

Bootstrap for LDA

bs_output_lda = bootstrap_validation(x_test, y_test, y_train, lda, metrics_list = [tpr], sample = 500)

#TPR distribution for LDA lda_tpr = tpr(y_pred_lda) fix, axs = plt.subplots(ncols=2, figsize=(12,5)) axs[1].set_xlabel('LDA TPR - Test Set TPR', fontsize = 16) axs[0].set_ylabel('Count', fontsize = 16) axs[0].hist(bs_output_lda.iloc[:, 0], bins = 20, edgecolor = 'green', linewidth = 2, color = 'grey') axs[0].set_xlim([0.4, 0.7]) axs[1].hist(bs_output_lda.iloc[:,0] - lda_tpr, bins = 20, edgecolor = 'green', linewidth = 2, color = 'grey') axs[1].set_xlim([-0.15, 0.15])

Bootstrap for CART

bs_output_cart = bootstrap_validation(x_test, y_test, y_train, dtc_test, metrics_list = [tpr], sample = 500)

#TPR distribution for CART cart_tpr = tpr(y_pred_cart) fix, axs = plt.subplots(ncols=2, figsize=(12,5)) axs[1].set_xlabel('CART TPR - Test Set TPR', fontsize = 16) axs[0].set_ylabel('Count', fontsize = 16) axs[0].hist(bs_output_cart.iloc[:, 0], bins = 20, edgecolor = 'green', linewidth = 2, color = 'grey') axs[0].set_xlim([0.4, 0.7]) axs[1].hist(bs_output_cart.iloc[:,0] - cart_tpr, bins = 20, edgecolor = 'green', linewidth = 2, color = 'grey') axs[1].set_xlim([-0.15, 0.15])

Bootstrap for RF

bs_output_rf = bootstrap_validation(x_test, y_test, y_train, rf, metrics_list = [tpr], sample = 500)

#TPR distribution for RF rf_tpr = tpr(y_pred_rf) fix, axs = plt.subplots(ncols=2, figsize=(12,5)) axs[1].set_xlabel('Random Forest TPR - Test Set TPR', fontsize = 16) axs[0].set_ylabel('Count', fontsize = 16) axs[0].hist(bs_output_rf.iloc[:, 0], bins = 20, edgecolor = 'green', linewidth = 2, color = 'grey') axs[0].set_xlim([0.4, 0.7]) axs[1].hist(bs_output_rf.iloc[:,0] - rf_tpr, bins = 20, edgecolor = 'green', linewidth = 2, color = 'grey') axs[1].set_xlim([-0.15, 0.15])

Bootstrap for Boosting

bs_output_boost = bootstrap_validation(x_test, y_test, y_train, gbr, metrics_list = [tpr], sample = 500)

#TPR distribution for Boosting boosting_tpr = tpr(y_pred_boost) fix, axs = plt.subplots(ncols=2, figsize=(12,5)) axs[1].set_xlabel('Boosting TPR - Test Set TPR', fontsize = 16) axs[0].set_ylabel('Count', fontsize = 16) axs[0].hist(bs_output_boost.iloc[:, 0], bins = 20, edgecolor = 'green', linewidth = 2, color = 'grey') axs[0].set_xlim([0.4, 0.7]) axs[1].hist(bs_output_boost.iloc[:,0] - boosting_tpr, bins = 20, edgecolor = 'green', linewidth = 2, color = 'grey') axs[1].set_xlim([-0.15, 0.15])

If these cells above worked, I would be able to see the distributions of TPR for all the models considered in this assignment and see the difference in its means, standard deviation, and potentially get an intuition of which model may have the highest TPR, and thus the highest quality.

Part C: Best Model

In Part C, I have included the code that calculate 95% Confidence Intervals of the distribution derived above. This would help determine which models do not share the interval, and therefore, safe to say the quality distribution is different from one another. Again, unfortunately, the calculations are not complete as the code requires values that were to be calculated in the previous section. I apologize for the inconvenience for grading purpose.

#95% CI of Logistic Regression CI_0_logreg = np.quantile(bs_output_logreg.iloc[:,0] - logreg_tpr, np.array([0.025, 0.975])) CI_logreg = [logreg_tpr - CI_0_logreg[1], logreg_tpr - CI_0_logreg[0]] print("The 95-percent confidence interval of Logistic Regression TPR is %s" % CI_logreg)

#95% CI of LDA CI_0_lda = np.quantile(bs_output_lda.iloc[:,0] - lda_tpr, np.array([0.025, 0.975])) CI_lda = [lda_tpr - CI_0_lda[1], lda_tpr - CI_0_lda[0]] print("The 95-percent confidence interval of LDA TPR is %s" % CI_lda)

#95% CI of CART CI_0_cart = np.quantile(bs_output_cart.iloc[:,0] - cart_tpr, np.array([0.025, 0.975])) CI_cart = [cart_tpr - CI_0_cart[1], cart_tpr - CI_0_cart[0]] print("The 95-percent confidence interval of CART TPR is %s" % CI_cart)

#95% CI of RF CI_0_rf = np.quantile(bs_output_rf.iloc[:,0] - rf_tpr, np.array([0.025, 0.975])) CI_rf = [rf_tpr - CI_0_rf[1], rf_tpr - CI_0_rf[0]] print("The 95-percent confidence interval of Random Forest TPR is %s" % CI_rf)

#95% CI of Boosting CI_0_boost = np.quantile(bs_output_boost.iloc[:,0] - boost_tpr, np.array([0.025, 0.975])) CI_boost = [boost_tpr - CI_0_boost[1], boost_tpr - CI_0_boost[0]] print("The 95-percent confidence interval of Boosting TPR is %s" % CI_boost)

Though I do not have the actual 95% CI values for the models as I hoped, I predict that it is very difficult to identify a specific model that best accomplishes the goal as the accuracy of the models seem to be fairly close among each other--if the 95% CI overlap, it is most likely impossible to conclude on one model that works the best. However, the difference may be more visible when TPR is used as the metric as opposed to accuracy.

From the models measured by accuracy, CART seems to have one of the highest accuracy. In the cells below, I will quantify the quality of the model using TPR after optimizing the ccp_alpha value.

#cross-validation to get the optimal ccp alpha #ccp_alpha range changed to 0 - 0.05 to accomodate for lack of granularity #fits reduced from 2010 (demonstrated in lab) to 110 & max_depth from 30 to 20 for sake of time grid_values = {'ccp_alpha':np.linspace(0, 0.05, 11), 'min_samples_leaf': [5], 'min_samples_split': [20], 'max_depth': [20], 'random_state': [88]} dtc_c = DecisionTreeClassifier() dtc_cv_acc_c = GridSearchCV(dtc_c, param_grid = grid_values, cv = 10, verbose = 1, scoring = 'recall') dtc_cv_acc_c.fit(x_train, y_train)

#derive the top 10 ccp_alpha values acc_c = dtc_cv_acc_c.cv_results_['mean_test_score'] ccp_c = dtc_cv_acc_c.cv_results_['param_ccp_alpha'].data pd.DataFrame({'ccp lpha': ccp_c, 'Validation Accuracy': acc_c}).head(10)

Unfortunately, cross validation on CART results in very low TPR score as shown above. To shift my direction, I have decided to examine the model with the lowest accuracy score from the previous section--random forest. In the cells below, I've attempted small sample CV on Random Forest to see how the quality changes.

from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV grid_values = {'max_features': np.linspace(1, 10, 10, dtype='int32'), 'min_samples_leaf':[5], 'n_estimators':[500], 'random_state':[88]} rf_c = RandomForestClassifier() cv_c = KFold(n_splits=5, random_state=333, shuffle=True) rf_cv_c = GridSearchCV(rf_c, param_grid=grid_values, scoring='recall', cv=cv_c, verbose=2) rf_cv_c.fit(x_train, y_train)

#select the best number of features max_features = rf_cv_c.cv_results_['param_max_features'].data scores = rf_cv_c.cv_results_['mean_test_score'] plt.figure(figsize=(8,6)) plt.xlabel('max features', fontsize = 16) plt.ylabel('CV TPR', fontsize = 16) plt.scatter(max_features, scores, s = 30) plt.grid(True, which = 'both') plt.xlim([1, 11]) plt.ylim([0.4, 1.0])

scores

For sake of preserving the running time, I have kept the number of features as the range from 1 to 10. This causes the model to lose precise picture of improvement among different hyperparameters. With that being said, the TPR for Random Forest model is the highest at 0.858. The change in TPR is shown above as the result of calling 'scores'. Since the model with the maximum number of features has the highest TPR, we do not know if this score is the highest possible number; ideally, we would compute cross validation with higher number of features (say, up to 50 features) or until we observe the decline in CV TPR value. Though I haven't explored every other models to optimize, this random forest model with 10 features may be a strong candidate for the best model.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Part B