Part B
To prevent Kernel from dying, I have decided to modify 'stack_stats_final.csv' from Part A behind the scene and upload the csv files that are ready to be used for Part B. Here are the steps I took to get to 'test.csv' and 'train.csv' below: 1) retokenize 'stack_stats_final.csv', 2) add 'isUseful' column, a binary column with value 1 indicating that the question is useful ('Score' greater than or equal to 1), 3) split into 'train.csv' and 'test.csv' based on the ID values of the provided train and test datasets.
Logistic Regression
The accuracy of the Logistic Regression using all features from the dataframe is 0.528. In this section, I have decided to include all features because the words included in the questions are already filtered in Part A.
LDA
The accuracy of the LDA model from this trial is 0.529, just slightly higher than Logistic Regression.
CART
The accuracy of CART mode is 0.527. Disclaimer that the optimal ccp-alpha may not be accurate as the granularity of the cross validation linspace is more rough than the default setting.
Random Forests
The accuracy of Random Forest model is 0.500, which is the lowest of all so far. However, it is too early to conclude this model's qualification as we haven't gone through bootstrapping to validate models.
Boosting
The accuracy of the Boosting model in this trial is 0.514.
Bootstrapping for final model
After calculating the quality of the models using "accuracy", I've started to believe that TPR may be the better metrics for this scenario as we aim to maximize the number of useful questions popping up at the top. Below is my attempt to apply that; note that the visualizations are not present as I wasn't able to solve the error "bootstrap_validation() got multiple values for argument 'metrics_list'" (shown under Boostrap for LDA section). Sample size is minimized to 500 to prevent the file from running too long. The code for all models are still shown here.
Bootstrap for Logistic Regression
Bootstrap for LDA
Bootstrap for CART
Bootstrap for RF
Bootstrap for Boosting
If these cells above worked, I would be able to see the distributions of TPR for all the models considered in this assignment and see the difference in its means, standard deviation, and potentially get an intuition of which model may have the highest TPR, and thus the highest quality.
Part C: Best Model
In Part C, I have included the code that calculate 95% Confidence Intervals of the distribution derived above. This would help determine which models do not share the interval, and therefore, safe to say the quality distribution is different from one another. Again, unfortunately, the calculations are not complete as the code requires values that were to be calculated in the previous section. I apologize for the inconvenience for grading purpose.
Though I do not have the actual 95% CI values for the models as I hoped, I predict that it is very difficult to identify a specific model that best accomplishes the goal as the accuracy of the models seem to be fairly close among each other--if the 95% CI overlap, it is most likely impossible to conclude on one model that works the best. However, the difference may be more visible when TPR is used as the metric as opposed to accuracy.
From the models measured by accuracy, CART seems to have one of the highest accuracy. In the cells below, I will quantify the quality of the model using TPR after optimizing the ccp_alpha value.
Unfortunately, cross validation on CART results in very low TPR score as shown above. To shift my direction, I have decided to examine the model with the lowest accuracy score from the previous section--random forest. In the cells below, I've attempted small sample CV on Random Forest to see how the quality changes.
For sake of preserving the running time, I have kept the number of features as the range from 1 to 10. This causes the model to lose precise picture of improvement among different hyperparameters. With that being said, the TPR for Random Forest model is the highest at 0.858. The change in TPR is shown above as the result of calling 'scores'. Since the model with the maximum number of features has the highest TPR, we do not know if this score is the highest possible number; ideally, we would compute cross validation with higher number of features (say, up to 50 features) or until we observe the decline in CV TPR value. Though I haven't explored every other models to optimize, this random forest model with 10 features may be a strong candidate for the best model.