Part B
To prevent Kernel from dying, I have decided to modify 'stack_stats_final.csv' from Part A behind the scene and upload the csv files that are ready to be used for Part B. Here are the steps I took to get to 'test.csv' and 'train.csv' below: 1) retokenize 'stack_stats_final.csv', 2) add 'isUseful' column, a binary column with value 1 indicating that the question is useful ('Score' greater than or equal to 1), 3) split into 'train.csv' and 'test.csv' based on the ID values of the provided train and test datasets.
Logistic Regression
Requirement already satisfied: statsmodels==0.13.0 in /root/venv/lib/python3.7/site-packages (0.13.0)
Requirement already satisfied: scipy>=1.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.0) (1.7.2)
Requirement already satisfied: numpy>=1.17 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.0) (1.19.5)
Requirement already satisfied: patsy>=0.5.2 in /root/venv/lib/python3.7/site-packages (from statsmodels==0.13.0) (0.5.2)
Requirement already satisfied: pandas>=0.25 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.0) (1.2.5)
Requirement already satisfied: six in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from patsy>=0.5.2->statsmodels==0.13.0) (1.16.0)
Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas>=0.25->statsmodels==0.13.0) (2021.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from pandas>=0.25->statsmodels==0.13.0) (2.8.2)
WARNING: You are using pip version 20.1.1; however, version 21.3.1 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
Optimization terminated successfully.
Current function value: 0.671761
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: isUseful No. Observations: 76990
Model: Logit Df Residuals: 76380
Method: MLE Df Model: 609
Date: Sat, 27 Nov 2021 Pseudo R-squ.: 0.03077
Time: 07:25:54 Log-Likelihood: -51719.
converged: True LL-Null: -53361.
Covariance Type: nonrobust LLR p-value: 0.000
===========================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------
abl_body -0.0049 0.030 -0.162 0.871 -0.064 0.054
actual_body -0.0576 0.022 -2.593 0.010 -0.101 -0.014
advanc_body -0.0141 0.038 -0.372 0.710 -0.089 0.060
algorithm_body 0.0135 0.017 0.771 0.441 -0.021 0.048
also_body 0.0384 0.013 3.026 0.002 0.014 0.063
analysi_body -0.0124 0.016 -0.791 0.429 -0.043 0.018
anoth_body -0.0417 0.014 -2.974 0.003 -0.069 -0.014
answer_body 0.0630 0.017 3.815 0.000 0.031 0.095
anyon_body 0.0752 0.031 2.432 0.015 0.015 0.136
appli_body 0.0081 0.019 0.420 0.675 -0.030 0.046
appreci_body 0.0154 0.027 0.567 0.571 -0.038 0.069
approach_body -0.0173 0.013 -1.306 0.192 -0.043 0.009
ask_body -0.0377 0.020 -1.908 0.056 -0.076 0.001
assum_body 0.0535 0.017 3.095 0.002 0.020 0.087
averag_body 0.0363 0.016 2.335 0.020 0.006 0.067
base_body -0.0470 0.018 -2.558 0.011 -0.083 -0.011
best_body 0.0198 0.018 1.100 0.271 -0.015 0.055
better_body 0.0324 0.021 1.575 0.115 -0.008 0.073
calcul_body -0.0197 0.011 -1.720 0.085 -0.042 0.003
cant_body 0.0590 0.030 1.996 0.046 0.001 0.117
case_body 0.0052 0.010 0.514 0.608 -0.015 0.025
chang_body 0.0394 0.014 2.773 0.006 0.012 0.067
class_body -0.0259 0.009 -2.756 0.006 -0.044 -0.007
code_body -0.0130 0.014 -0.904 0.366 -0.041 0.015
coeffici_body 0.0111 0.015 0.764 0.445 -0.017 0.040
come_body -0.0102 0.026 -0.391 0.696 -0.061 0.041
compar_body 0.0138 0.015 0.894 0.372 -0.017 0.044
comput_body 0.0318 0.015 2.088 0.037 0.002 0.062
condit_body 0.0156 0.012 1.281 0.200 -0.008 0.040
confus_body 0.0427 0.027 1.585 0.113 -0.010 0.096
consid_body 0.0464 0.019 2.433 0.015 0.009 0.084
continu_body 0.0064 0.019 0.342 0.732 -0.030 0.043
correct_body 0.0101 0.016 0.614 0.539 -0.022 0.042
correl_body -0.0063 0.012 -0.548 0.584 -0.029 0.016
could_body -0.0071 0.014 -0.495 0.621 -0.035 0.021
covari_body 0.0331 0.015 2.155 0.031 0.003 0.063
creat_body -0.1283 0.017 -7.366 0.000 -0.162 -0.094
current_body 0.0587 0.026 2.294 0.022 0.009 0.109
data_body -0.0034 0.005 -0.677 0.498 -0.013 0.006
dataset_body -0.0410 0.011 -3.682 0.000 -0.063 -0.019
defin_body 0.0156 0.022 0.725 0.468 -0.027 0.058
depend_body -0.0497 0.017 -2.986 0.003 -0.082 -0.017
determin_body 0.0594 0.023 2.561 0.010 0.014 0.105
differ_body -0.0274 0.008 -3.441 0.001 -0.043 -0.012
distribut_body -0.0026 0.006 -0.413 0.680 -0.015 0.010
dont_body -0.0129 0.018 -0.711 0.477 -0.049 0.023
effect_body 0.0230 0.010 2.414 0.016 0.004 0.042
eg_body 0.0229 0.020 1.141 0.254 -0.016 0.062
equal_body 0.0128 0.021 0.616 0.538 -0.028 0.053
error_body 0.0057 0.009 0.618 0.537 -0.012 0.024
estim_body -0.0093 0.008 -1.174 0.240 -0.025 0.006
etc_body -0.0358 0.026 -1.354 0.176 -0.088 0.016
even_body -0.0139 0.017 -0.805 0.421 -0.048 0.020
exampl_body 0.0401 0.012 3.244 0.001 0.016 0.064
expect_body -0.0138 0.017 -0.829 0.407 -0.046 0.019
explain_body -0.0130 0.023 -0.563 0.574 -0.058 0.032
factor_body -0.0134 0.012 -1.148 0.251 -0.036 0.009
far_body 0.0903 0.030 3.050 0.002 0.032 0.148
featur_body -0.0074 0.009 -0.859 0.390 -0.024 0.009
find_body 0.0030 0.014 0.215 0.829 -0.024 0.030
first_body 0.0163 0.016 1.032 0.302 -0.015 0.047
fit_body -0.0023 0.012 -0.193 0.847 -0.026 0.021
fix_body -0.0307 0.019 -1.625 0.104 -0.068 0.006
follow_body -0.0035 0.012 -0.286 0.775 -0.027 0.020
formula_body 0.0537 0.021 2.593 0.010 0.013 0.094
found_body 0.0381 0.023 1.646 0.100 -0.007 0.083
function_body 0.0297 0.009 3.385 0.001 0.013 0.047
gener_body 0.0568 0.013 4.454 0.000 0.032 0.082
get_body -0.0135 0.011 -1.189 0.235 -0.036 0.009
give_body 0.0572 0.019 2.987 0.003 0.020 0.095
given_body 0.0148 0.015 0.996 0.319 -0.014 0.044
go_body 0.0217 0.024 0.901 0.368 -0.026 0.069
good_body -0.0647 0.020 -3.319 0.001 -0.103 -0.027
group_body -0.0040 0.007 -0.595 0.552 -0.017 0.009
help_body -0.0146 0.019 -0.754 0.451 -0.053 0.023
howev_body 0.0581 0.016 3.715 0.000 0.027 0.089
id_body 0.0105 0.018 0.599 0.549 -0.024 0.045
idea_body 0.0116 0.025 0.454 0.650 -0.038 0.061
ie_body 0.0921 0.018 5.076 0.000 0.057 0.128
im_body 0.0031 0.010 0.312 0.755 -0.016 0.022
import_body 0.0057 0.013 0.423 0.672 -0.021 0.032
includ_body -0.0422 0.019 -2.256 0.024 -0.079 -0.006
independ_body 0.0258 0.015 1.665 0.096 -0.005 0.056
individu_body 0.0217 0.014 1.507 0.132 -0.007 0.050
inform_body 0.0492 0.019 2.638 0.008 0.013 0.086
input_body -0.0152 0.017 -0.875 0.381 -0.049 0.019
instead_body 0.0190 0.025 0.773 0.440 -0.029 0.067
interest_body -0.0684 0.020 -3.446 0.001 -0.107 -0.030
interpret_body 0.0226 0.020 1.125 0.261 -0.017 0.062
interv_body 0.0318 0.014 2.260 0.024 0.004 0.059
ive_body 0.0152 0.018 0.840 0.401 -0.020 0.051
know_body -0.0093 0.013 -0.734 0.463 -0.034 0.015
learn_body 0.0913 0.015 6.020 0.000 0.062 0.121
let_body 0.0084 0.016 0.515 0.606 -0.024 0.040
level_body -0.0277 0.012 -2.348 0.019 -0.051 -0.005
like_body -0.0115 0.011 -1.058 0.290 -0.033 0.010
linear_body -0.0012 0.013 -0.093 0.926 -0.027 0.025
look_body -0.0416 0.015 -2.800 0.005 -0.071 -0.012
make_body -0.0003 0.017 -0.015 0.988 -0.033 0.033
mani_body -0.0415 0.020 -2.083 0.037 -0.081 -0.002
matrix_body 0.0138 0.012 1.140 0.254 -0.010 0.038
may_body 0.0907 0.022 4.209 0.000 0.048 0.133
mean_body 0.0072 0.008 0.892 0.373 -0.009 0.023
measur_body -0.0378 0.009 -4.025 0.000 -0.056 -0.019
method_body 0.0283 0.010 2.806 0.005 0.009 0.048
might_body 0.0169 0.024 0.709 0.478 -0.030 0.064
model_body 0.0026 0.005 0.578 0.563 -0.006 0.011
much_body -0.0244 0.022 -1.089 0.276 -0.068 0.020
multipl_body 0.0060 0.020 0.305 0.760 -0.032 0.044
need_body -0.0409 0.016 -2.608 0.009 -0.072 -0.010
new_body -0.0316 0.018 -1.776 0.076 -0.066 0.003
normal_body -1.693e-05 0.009 -0.002 0.999 -0.018 0.018
number_body -0.0036 0.010 -0.370 0.712 -0.022 0.015
observ_body 0.0425 0.011 4.007 0.000 0.022 0.063
obtain_body 0.0059 0.024 0.251 0.802 -0.040 0.052
one_body 0.0128 0.009 1.405 0.160 -0.005 0.031
order_body -0.0018 0.017 -0.105 0.916 -0.034 0.031
output_body 0.0085 0.015 0.582 0.561 -0.020 0.037
packag_body 0.0044 0.020 0.221 0.825 -0.035 0.043
paper_body -0.0048 0.019 -0.256 0.798 -0.042 0.032
paramet_body -0.0159 0.011 -1.490 0.136 -0.037 0.005
perform_body 0.0226 0.014 1.599 0.110 -0.005 0.050
pleas_body 0.0379 0.028 1.342 0.180 -0.017 0.093
plot_body 0.0079 0.013 0.630 0.528 -0.017 0.032
point_body -0.0439 0.010 -4.214 0.000 -0.064 -0.023
posit_body -0.0132 0.016 -0.846 0.398 -0.044 0.017
possibl_body 0.0395 0.018 2.240 0.025 0.005 0.074
predict_body 0.0039 0.009 0.432 0.666 -0.014 0.022
probabl_body -0.0038 0.009 -0.411 0.681 -0.022 0.014
problem_body 0.0058 0.012 0.476 0.634 -0.018 0.030
process_body 0.0020 0.017 0.118 0.906 -0.031 0.034
question_body -0.0070 0.010 -0.676 0.499 -0.027 0.013
random_body -0.0012 0.010 -0.124 0.901 -0.021 0.018
read_body 0.0058 0.020 0.285 0.776 -0.034 0.046
realli_body 0.0094 0.026 0.361 0.718 -0.042 0.061
reason_body 0.0770 0.025 3.070 0.002 0.028 0.126
refer_body -0.0311 0.022 -1.407 0.159 -0.075 0.012
regress_body 0.0193 0.010 1.993 0.046 0.000 0.038
relat_body 0.0127 0.023 0.552 0.581 -0.032 0.058
repres_body 0.0189 0.023 0.821 0.412 -0.026 0.064
respons_body 0.0186 0.015 1.209 0.227 -0.012 0.049
result_body 0.0049 0.011 0.441 0.659 -0.017 0.026
right_body 0.0194 0.013 1.515 0.130 -0.006 0.044
run_body 0.0420 0.015 2.732 0.006 0.012 0.072
sampl_body 0.0039 0.006 0.600 0.548 -0.009 0.017
say_body -0.0103 0.015 -0.690 0.490 -0.039 0.019
score_body 0.0072 0.011 0.680 0.497 -0.013 0.028
second_body -0.0386 0.022 -1.787 0.074 -0.081 0.004
see_body 0.0073 0.016 0.464 0.643 -0.024 0.038
seem_body 0.0383 0.016 2.333 0.020 0.006 0.070
sens_body 0.0065 0.028 0.228 0.820 -0.049 0.062
seri_body -0.0301 0.013 -2.299 0.022 -0.056 -0.004
set_body -0.0178 0.009 -1.933 0.053 -0.036 0.000
show_body 0.0567 0.019 3.004 0.003 0.020 0.094
signific_body -0.0650 0.015 -4.288 0.000 -0.095 -0.035
similar_body -0.0211 0.020 -1.061 0.289 -0.060 0.018
simpl_body 0.0474 0.025 1.923 0.054 -0.001 0.096
sinc_body -0.0082 0.019 -0.435 0.663 -0.045 0.029
size_body 0.0014 0.012 0.111 0.912 -0.023 0.026
someon_body -0.0332 0.030 -1.120 0.263 -0.091 0.025
someth_body 0.0603 0.022 2.757 0.006 0.017 0.103
specif_body 0.0472 0.021 2.197 0.028 0.005 0.089
standard_body 0.0018 0.014 0.131 0.896 -0.025 0.028
start_body 0.0035 0.023 0.154 0.878 -0.041 0.048
statist_body 0.0307 0.013 2.362 0.018 0.005 0.056
still_body -0.0079 0.026 -0.306 0.760 -0.058 0.043
studi_body 0.0606 0.016 3.762 0.000 0.029 0.092
suggest_body 0.0174 0.025 0.688 0.492 -0.032 0.067
suppos_body 0.0808 0.022 3.685 0.000 0.038 0.124
sure_body -0.0040 0.021 -0.191 0.848 -0.045 0.037
take_body -0.0070 0.017 -0.408 0.683 -0.041 0.027
term_body 0.0094 0.015 0.635 0.526 -0.020 0.038
test_body -0.0036 0.007 -0.539 0.590 -0.017 0.010
thank_body -0.0773 0.021 -3.767 0.000 -0.118 -0.037
think_body -0.0043 0.018 -0.236 0.813 -0.040 0.031
thought_body 0.0399 0.028 1.406 0.160 -0.016 0.095
time_body 0.0182 0.005 3.424 0.001 0.008 0.029
total_body -0.0358 0.019 -1.849 0.065 -0.074 0.002
train_body -0.0340 0.010 -3.294 0.001 -0.054 -0.014
tri_body -0.0699 0.013 -5.286 0.000 -0.096 -0.044
true_body 0.0233 0.013 1.842 0.065 -0.001 0.048
two_body -0.0043 0.010 -0.418 0.676 -0.024 0.016
type_body -0.0500 0.015 -3.343 0.001 -0.079 -0.021
understand_body 0.0486 0.014 3.534 0.000 0.022 0.076
use_body -0.0139 0.006 -2.312 0.021 -0.026 -0.002
valid_body -0.0048 0.006 -0.741 0.459 -0.017 0.008
valu_body -0.0123 0.007 -1.871 0.061 -0.025 0.001
variabl_body 0.0013 0.006 0.210 0.834 -0.011 0.013
varianc_body 0.0109 0.011 0.976 0.329 -0.011 0.033
vector_body -0.0105 0.014 -0.769 0.442 -0.037 0.016
want_body -0.0788 0.012 -6.310 0.000 -0.103 -0.054
way_body 0.0193 0.014 1.335 0.182 -0.009 0.048
weight_body -0.0093 0.012 -0.768 0.443 -0.033 0.014
well_body -0.0071 0.024 -0.296 0.767 -0.054 0.040
whether_body -3.55e-05 0.019 -0.002 0.998 -0.037 0.037
without_body 0.0096 0.026 0.372 0.710 -0.041 0.060
wonder_body -0.0131 0.026 -0.512 0.609 -0.063 0.037
work_body -0.0529 0.016 -3.385 0.001 -0.083 -0.022
would_body -0.0076 0.008 -0.910 0.363 -0.024 0.009
wrong_body 0.0669 0.027 2.440 0.015 0.013 0.121
accuraci_title -0.1710 0.083 -2.057 0.040 -0.334 -0.008
adjust_title -0.0851 0.096 -0.889 0.374 -0.273 0.102
algorithm_title -0.1359 0.069 -1.966 0.049 -0.271 -0.000
analysi_title 0.0345 0.046 0.748 0.455 -0.056 0.125
anova_title -0.2073 0.069 -3.011 0.003 -0.342 -0.072
appli_title -0.2125 0.091 -2.335 0.020 -0.391 -0.034
approach_title -0.1082 0.100 -1.079 0.281 -0.305 0.088
appropri_title 0.0395 0.098 0.405 0.685 -0.152 0.231
arima_title 0.3569 0.099 3.615 0.000 0.163 0.550
assumpt_title 0.1042 0.077 1.352 0.177 -0.047 0.255
averag_title -0.1452 0.075 -1.939 0.052 -0.292 0.002
base_title -0.1345 0.075 -1.804 0.071 -0.281 0.012
bay_title 0.2467 0.100 2.460 0.014 0.050 0.443
bayesian_title -0.0329 0.060 -0.551 0.581 -0.150 0.084
best_title -0.1792 0.070 -2.562 0.010 -0.316 -0.042
beta_title -0.0758 0.096 -0.790 0.430 -0.264 0.112
bia_title 0.0151 0.095 0.158 0.874 -0.172 0.202
binari_title 0.1284 0.073 1.761 0.078 -0.014 0.271
binomi_title 0.1722 0.058 2.980 0.003 0.059 0.285
bootstrap_title 0.0209 0.091 0.231 0.817 -0.157 0.198
calcul_title -0.0185 0.041 -0.450 0.653 -0.099 0.062
case_title -0.0588 0.092 -0.638 0.523 -0.239 0.122
categor_title 0.1212 0.067 1.807 0.071 -0.010 0.253
chang_title 0.2765 0.069 4.027 0.000 0.142 0.411
check_title -0.1459 0.107 -1.362 0.173 -0.356 0.064
choos_title 0.1208 0.085 1.423 0.155 -0.046 0.287
class_title -0.1537 0.082 -1.868 0.062 -0.315 0.008
classif_title -0.1835 0.064 -2.867 0.004 -0.309 -0.058
classifi_title 0.0320 0.084 0.379 0.705 -0.134 0.198
cluster_title -0.1309 0.056 -2.330 0.020 -0.241 -0.021
coeffici_title 0.0671 0.051 1.325 0.185 -0.032 0.166
combin_title -0.0208 0.084 -0.246 0.805 -0.186 0.144
compar_title -0.1234 0.052 -2.382 0.017 -0.225 -0.022
comparison_title -0.0619 0.086 -0.723 0.469 -0.230 0.106
compon_title -0.1629 0.094 -1.734 0.083 -0.347 0.021
comput_title -0.0857 0.073 -1.179 0.239 -0.228 0.057
condit_title 0.0265 0.062 0.427 0.669 -0.095 0.148
confid_title 0.0262 0.091 0.288 0.773 -0.152 0.204
continu_title 0.0258 0.068 0.380 0.704 -0.107 0.159
control_title 0.0085 0.091 0.094 0.925 -0.169 0.186
converg_title 0.1880 0.087 2.161 0.031 0.017 0.359
correct_title 0.1176 0.070 1.685 0.092 -0.019 0.254
correl_title -0.0232 0.047 -0.498 0.619 -0.115 0.068
count_title 0.2376 0.093 2.562 0.010 0.056 0.419
covari_title -0.0828 0.065 -1.276 0.202 -0.210 0.044
creat_title -0.3520 0.105 -3.357 0.001 -0.558 -0.146
cross_title 0.3411 0.081 4.226 0.000 0.183 0.499
curv_title 0.0786 0.081 0.965 0.335 -0.081 0.238
data_title 0.0034 0.026 0.131 0.896 -0.048 0.054
dataset_title 0.0495 0.061 0.812 0.417 -0.070 0.169
deal_title -0.0692 0.099 -0.700 0.484 -0.263 0.125
decis_title -0.3220 0.104 -3.104 0.002 -0.525 -0.119
densiti_title 0.1701 0.085 2.013 0.044 0.004 0.336
depend_title -0.2479 0.064 -3.873 0.000 -0.373 -0.122
deriv_title -0.0097 0.079 -0.122 0.903 -0.164 0.145
design_title 0.0355 0.079 0.449 0.653 -0.119 0.190
detect_title -0.1400 0.084 -1.664 0.096 -0.305 0.025
determin_title -0.2792 0.071 -3.941 0.000 -0.418 -0.140
deviat_title 0.0096 0.093 0.104 0.918 -0.173 0.192
differ_title 0.0071 0.029 0.246 0.806 -0.049 0.064
discret_title -0.0063 0.089 -0.071 0.943 -0.181 0.168
distanc_title 0.2677 0.096 2.784 0.005 0.079 0.456
distribut_title -0.0578 0.029 -2.012 0.044 -0.114 -0.001
effect_title -0.0171 0.042 -0.412 0.680 -0.099 0.064
equal_title -0.0721 0.097 -0.747 0.455 -0.261 0.117
equat_title -0.0292 0.093 -0.315 0.753 -0.211 0.152
error_title -0.0177 0.044 -0.400 0.689 -0.105 0.069
estim_title -0.0963 0.035 -2.763 0.006 -0.165 -0.028
evalu_title -0.4491 0.099 -4.553 0.000 -0.642 -0.256
event_title 0.0227 0.087 0.262 0.793 -0.147 0.192
exampl_title 0.2242 0.099 2.268 0.023 0.030 0.418
expect_title 0.1668 0.064 2.587 0.010 0.040 0.293
exponenti_title 0.0771 0.090 0.859 0.390 -0.099 0.253
factor_title -0.0749 0.068 -1.108 0.268 -0.207 0.058
featur_title 0.0049 0.054 0.092 0.927 -0.100 0.110
find_title 0.1023 0.053 1.936 0.053 -0.001 0.206
fit_title -0.0568 0.058 -0.985 0.325 -0.170 0.056
fix_title -0.1792 0.082 -2.177 0.030 -0.341 -0.018
forecast_title -0.0845 0.071 -1.187 0.235 -0.224 0.055
forest_title 0.1868 0.101 1.847 0.065 -0.011 0.385
formula_title 0.0957 0.101 0.946 0.344 -0.103 0.294
function_title -0.0771 0.037 -2.088 0.037 -0.150 -0.005
gaussian_title 0.0277 0.071 0.389 0.697 -0.112 0.167
gener_title -0.0519 0.055 -0.951 0.342 -0.159 0.055
get_title -0.1059 0.078 -1.357 0.175 -0.259 0.047
given_title 0.0558 0.067 0.838 0.402 -0.075 0.186
glm_title 0.1027 0.082 1.251 0.211 -0.058 0.264
good_title -0.0063 0.094 -0.067 0.947 -0.190 0.177
gradient_title -0.0178 0.087 -0.205 0.837 -0.187 0.152
group_title -0.0439 0.054 -0.820 0.412 -0.149 0.061
help_title -0.2296 0.092 -2.505 0.012 -0.409 -0.050
high_title 0.0585 0.088 0.665 0.506 -0.114 0.231
hypothesi_title 0.1800 0.072 2.488 0.013 0.038 0.322
import_title 0.1839 0.099 1.860 0.063 -0.010 0.378
includ_title 0.1591 0.098 1.619 0.105 -0.034 0.352
independ_title 0.0806 0.058 1.394 0.163 -0.033 0.194
infer_title 0.1307 0.091 1.439 0.150 -0.047 0.309
inform_title -0.1084 0.089 -1.218 0.223 -0.283 0.066
input_title -0.2184 0.090 -2.436 0.015 -0.394 -0.043
interact_title 0.2357 0.069 3.423 0.001 0.101 0.371
interpret_title -0.1227 0.046 -2.663 0.008 -0.213 -0.032
interv_title -0.0979 0.086 -1.142 0.254 -0.266 0.070
joint_title 0.2078 0.099 2.092 0.036 0.013 0.402
kernel_title -0.0010 0.094 -0.011 0.991 -0.185 0.183
larg_title -0.0318 0.088 -0.362 0.718 -0.204 0.140
learn_title -0.1547 0.058 -2.662 0.008 -0.269 -0.041
least_title -0.2871 0.120 -2.385 0.017 -0.523 -0.051
level_title -0.2170 0.071 -3.048 0.002 -0.356 -0.077
likelihood_title 0.1303 0.069 1.878 0.060 -0.006 0.266
limit_title -0.0191 0.098 -0.194 0.846 -0.212 0.174
linear_title 0.1157 0.039 2.964 0.003 0.039 0.192
log_title 0.1014 0.087 1.162 0.245 -0.070 0.273
logist_title -0.0048 0.061 -0.078 0.938 -0.125 0.115
loss_title -0.0583 0.073 -0.799 0.424 -0.201 0.085
machin_title 0.0443 0.092 0.484 0.629 -0.135 0.224
make_title 0.1209 0.074 1.641 0.101 -0.024 0.265
mani_title 0.0028 0.097 0.029 0.977 -0.187 0.192
margin_title -0.1041 0.087 -1.193 0.233 -0.275 0.067
matrix_title -0.0457 0.060 -0.764 0.445 -0.163 0.071
maximum_title -0.2415 0.098 -2.458 0.014 -0.434 -0.049
mean_title -0.0365 0.036 -1.011 0.312 -0.107 0.034
measur_title -0.0216 0.057 -0.381 0.703 -0.133 0.089
method_title 0.0366 0.053 0.690 0.490 -0.067 0.140
metric_title 0.0062 0.091 0.069 0.945 -0.172 0.184
miss_title -0.2819 0.106 -2.663 0.008 -0.489 -0.074
mix_title 0.1794 0.063 2.859 0.004 0.056 0.302
model_title -0.0902 0.022 -4.086 0.000 -0.133 -0.047
multipl_title -0.2908 0.045 -6.512 0.000 -0.378 -0.203
multivari_title 0.0409 0.071 0.573 0.567 -0.099 0.181
need_title -0.0821 0.078 -1.047 0.295 -0.236 0.072
neg_title 0.0560 0.076 0.734 0.463 -0.094 0.206
nest_title 0.2614 0.104 2.508 0.012 0.057 0.466
network_title 0.0409 0.092 0.446 0.656 -0.139 0.221
neural_title 0.0381 0.106 0.361 0.718 -0.169 0.245
nonlinear_title 0.3607 0.103 3.515 0.000 0.160 0.562
normal_title 0.0379 0.045 0.839 0.401 -0.051 0.127
number_title 0.2664 0.055 4.835 0.000 0.158 0.374
observ_title -0.0370 0.070 -0.527 0.598 -0.175 0.101
ol_title 0.1776 0.094 1.899 0.058 -0.006 0.361
one_title 0.0403 0.048 0.835 0.404 -0.054 0.135
optim_title -0.0538 0.076 -0.707 0.480 -0.203 0.095
order_title -0.1436 0.087 -1.659 0.097 -0.313 0.026
outcom_title 0.0233 0.086 0.272 0.786 -0.145 0.191
output_title -0.0322 0.069 -0.469 0.639 -0.167 0.102
packag_title -0.1802 0.090 -1.999 0.046 -0.357 -0.004
panel_title -0.0929 0.103 -0.899 0.368 -0.295 0.110
paramet_title -0.0640 0.050 -1.271 0.204 -0.163 0.035
pca_title 0.2845 0.084 3.376 0.001 0.119 0.450
perform_title 0.0515 0.066 0.776 0.438 -0.079 0.181
plot_title -0.1234 0.068 -1.817 0.069 -0.256 0.010
point_title -0.0458 0.074 -0.622 0.534 -0.190 0.098
poisson_title 0.0341 0.078 0.438 0.661 -0.119 0.187
popul_title -0.0072 0.075 -0.096 0.924 -0.154 0.140
posit_title -0.1413 0.095 -1.486 0.137 -0.328 0.045
possibl_title -0.0117 0.078 -0.150 0.880 -0.164 0.141
posterior_title -0.1763 0.093 -1.904 0.057 -0.358 0.005
power_title -0.3977 0.094 -4.237 0.000 -0.582 -0.214
predict_title -0.2108 0.046 -4.606 0.000 -0.301 -0.121
predictor_title -0.0070 0.068 -0.103 0.918 -0.139 0.125
prior_title 0.2924 0.085 3.431 0.001 0.125 0.459
probabl_title 0.0386 0.041 0.952 0.341 -0.041 0.118
problem_title 0.1288 0.063 2.030 0.042 0.004 0.253
process_title -0.0433 0.067 -0.650 0.516 -0.174 0.087
proport_title -0.2158 0.076 -2.842 0.004 -0.365 -0.067
pvalu_title -0.1567 0.080 -1.951 0.051 -0.314 0.001
python_title -0.0197 0.091 -0.217 0.828 -0.198 0.158
question_title -0.1008 0.061 -1.648 0.099 -0.221 0.019
random_title 0.0129 0.038 0.334 0.738 -0.063 0.088
rank_title -0.4795 0.094 -5.118 0.000 -0.663 -0.296
rate_title -0.1636 0.074 -2.225 0.026 -0.308 -0.020
ratio_title -0.3235 0.079 -4.078 0.000 -0.479 -0.168
regress_title -0.0647 0.029 -2.250 0.024 -0.121 -0.008
relat_title -0.1880 0.096 -1.965 0.049 -0.376 -0.000
relationship_title 0.2907 0.086 3.396 0.001 0.123 0.458
repeat_title 0.0819 0.088 0.932 0.351 -0.090 0.254
residu_title 0.0824 0.078 1.056 0.291 -0.071 0.235
respons_title 0.0886 0.094 0.942 0.346 -0.096 0.273
result_title -0.1240 0.053 -2.348 0.019 -0.227 -0.020
run_title 0.0686 0.102 0.675 0.500 -0.131 0.268
sampl_title -0.0799 0.034 -2.318 0.020 -0.147 -0.012
scale_title -0.1037 0.068 -1.518 0.129 -0.238 0.030
score_title -0.2362 0.066 -3.569 0.000 -0.366 -0.107
select_title 0.0577 0.065 0.887 0.375 -0.070 0.185
seri_title -0.0977 0.062 -1.568 0.117 -0.220 0.024
set_title 0.0819 0.053 1.547 0.122 -0.022 0.186
show_title 0.2246 0.096 2.343 0.019 0.037 0.412
signific_title -0.0472 0.057 -0.832 0.405 -0.158 0.064
similar_title -0.1425 0.104 -1.371 0.170 -0.346 0.061
simpl_title -0.2576 0.094 -2.751 0.006 -0.441 -0.074
simul_title -0.0629 0.092 -0.684 0.494 -0.243 0.117
singl_title 0.2319 0.096 2.416 0.016 0.044 0.420
size_title -0.0233 0.059 -0.395 0.693 -0.139 0.092
small_title 0.3649 0.090 4.036 0.000 0.188 0.542
specif_title 0.1101 0.095 1.155 0.248 -0.077 0.297
squar_title 0.0305 0.076 0.402 0.688 -0.118 0.179
standard_title 0.0780 0.061 1.277 0.202 -0.042 0.198
statist_title 0.0136 0.042 0.323 0.747 -0.069 0.096
structur_title 0.1739 0.103 1.688 0.091 -0.028 0.376
studi_title -0.0209 0.096 -0.217 0.829 -0.210 0.168
sum_title 0.2743 0.079 3.492 0.000 0.120 0.428
surviv_title -0.2068 0.101 -2.046 0.041 -0.405 -0.009
term_title -0.1275 0.070 -1.834 0.067 -0.264 0.009
test_title -0.0458 0.028 -1.614 0.107 -0.101 0.010
time_title 0.0176 0.048 0.370 0.711 -0.076 0.111
train_title -0.0193 0.065 -0.297 0.766 -0.146 0.108
transform_title -0.0268 0.066 -0.407 0.684 -0.156 0.102
treatment_title 0.0066 0.101 0.066 0.948 -0.191 0.205
tree_title 0.0280 0.104 0.268 0.788 -0.177 0.233
ttest_title 0.0317 0.081 0.391 0.696 -0.127 0.190
two_title -0.0953 0.039 -2.419 0.016 -0.173 -0.018
type_title -0.3511 0.089 -3.942 0.000 -0.526 -0.177
understand_title 0.2438 0.073 3.357 0.001 0.101 0.386
use_title -0.0667 0.025 -2.696 0.007 -0.115 -0.018
valid_title -0.1490 0.068 -2.208 0.027 -0.281 -0.017
valu_title -0.0493 0.037 -1.324 0.186 -0.122 0.024
variabl_title -0.0187 0.028 -0.672 0.501 -0.073 0.036
varianc_title 0.0715 0.047 1.528 0.126 -0.020 0.163
vector_title -0.2694 0.082 -3.287 0.001 -0.430 -0.109
vs_title 0.0729 0.046 1.599 0.110 -0.016 0.162
way_title 0.0843 0.065 1.290 0.197 -0.044 0.213
weight_title -0.0632 0.065 -0.978 0.328 -0.190 0.063
within_title -0.5571 0.105 -5.317 0.000 -0.763 -0.352
without_title 0.1548 0.086 1.797 0.072 -0.014 0.324
work_title 0.2608 0.102 2.558 0.011 0.061 0.461
would_title -0.1508 0.107 -1.404 0.160 -0.361 0.060
zero_title -0.2083 0.084 -2.489 0.013 -0.372 -0.044
additive_tags -0.0926 nan nan nan nan nan
aic_tags 0.3571 0.111 3.218 0.001 0.140 0.575
analysis_tags -0.0390 0.038 -1.015 0.310 -0.114 0.036
anova_tags 0.1204 0.056 2.135 0.033 0.010 0.231
arima_tags -0.0490 0.068 -0.725 0.468 -0.181 0.083
autocorrelation_tags 0.0627 0.086 0.726 0.468 -0.107 0.232
bayes_tags -0.0583 0.094 -0.619 0.536 -0.243 0.126
bayesian_tags 0.1128 0.039 2.876 0.004 0.036 0.190
bias_tags 0.1038 0.085 1.217 0.224 -0.063 0.271
binary_tags -0.0253 0.099 -0.256 0.798 -0.219 0.168
binomial_tags -0.0429 0.080 -0.533 0.594 -0.200 0.115
biostatistics_tags -0.0870 0.097 -0.892 0.372 -0.278 0.104
boosting_tags 0.1683 0.077 2.188 0.029 0.018 0.319
bootstrap_tags -0.0859 0.078 -1.100 0.272 -0.239 0.067
carlo_tags -0.0181 nan nan nan nan nan
cart_tags 0.0063 0.149 0.042 0.966 -0.286 0.298
categorical_tags 0.1886 0.078 2.421 0.015 0.036 0.341
causality_tags 0.5869 0.073 8.027 0.000 0.444 0.730
central_tags 0.2708 0.191 1.414 0.157 -0.104 0.646
chain_tags 0.1283 nan nan nan nan nan
chi_tags -0.2794 0.138 -2.018 0.044 -0.551 -0.008
classes_tags -0.1665 1.2e+08 -1.38e-09 1.000 -2.36e+08 2.36e+08
classification_tags 0.0372 0.043 0.870 0.384 -0.047 0.121
clustering_tags -0.0702 0.055 -1.274 0.203 -0.178 0.038
coefficients_tags -0.1227 0.081 -1.508 0.132 -0.282 0.037
comparison_tags 0.1843 0.115 1.608 0.108 -0.040 0.409
comparisons_tags 0.2929 0.095 3.084 0.002 0.107 0.479
conditional_tags -0.3160 0.070 -4.491 0.000 -0.454 -0.178
confidence_tags -0.2771 0.148 -1.875 0.061 -0.567 0.013
conv_tags -0.2106 0.207 -1.018 0.309 -0.616 0.195
convergence_tags -0.0554 0.100 -0.555 0.579 -0.251 0.140
correlation_tags -0.0698 0.044 -1.573 0.116 -0.157 0.017
covariance_tags 0.1549 0.056 2.768 0.006 0.045 0.265
cox_tags 0.2254 0.120 1.885 0.059 -0.009 0.460
cross_tags 0.0165 0.082 0.201 0.840 -0.144 0.177
cumulative_tags -0.4416 0.130 -3.389 0.001 -0.697 -0.186
data_tags -0.0330 0.038 -0.875 0.381 -0.107 0.041
dataset_tags -0.0728 0.074 -0.988 0.323 -0.217 0.072
density_tags -0.4337 0.098 -4.425 0.000 -0.626 -0.242
descriptive_tags -0.2126 0.100 -2.130 0.033 -0.408 -0.017
design_tags -0.0184 0.195 -0.095 0.925 -0.401 0.364
detection_tags -0.0715 0.094 -0.757 0.449 -0.257 0.114
deviation_tags -0.0713 0.118 -0.603 0.546 -0.303 0.160
dimensionality_tags -0.0360 7.74e+06 -4.65e-09 1.000 -1.52e+07 1.52e+07
distance_tags 0.0314 0.104 0.301 0.763 -0.173 0.236
distribution_tags 0.0621 0.032 1.925 0.054 -0.001 0.125
distributions_tags 0.1161 0.039 3.005 0.003 0.040 0.192
econometrics_tags 0.3147 0.066 4.744 0.000 0.185 0.445
effect_tags -0.1008 0.090 -1.117 0.264 -0.277 0.076
effects_tags -0.0251 0.105 -0.240 0.810 -0.230 0.180
encoding_tags -0.1794 0.131 -1.370 0.171 -0.436 0.077
engineering_tags -0.2039 0.105 -1.933 0.053 -0.411 0.003
entropy_tags 0.2899 0.088 3.285 0.001 0.117 0.463
error_tags 0.0631 0.056 1.135 0.256 -0.046 0.172
estimation_tags 0.3297 0.060 5.452 0.000 0.211 0.448
estimators_tags 0.0954 0.099 0.961 0.336 -0.099 0.290
evaluation_tags 0.2309 0.043 5.334 0.000 0.146 0.316
expectation_tags 0.2100 0.106 1.977 0.048 0.002 0.418
expected_tags 0.0651 0.093 0.699 0.484 -0.117 0.248
experiment_tags 0.2995 0.217 1.378 0.168 -0.126 0.726
exponential_tags -0.0274 0.091 -0.300 0.764 -0.207 0.152
factor_tags 0.0236 0.086 0.275 0.783 -0.144 0.192
feature_tags -0.2445 0.047 -5.148 0.000 -0.338 -0.151
fitting_tags 0.0643 0.086 0.752 0.452 -0.103 0.232
fixed_tags 0.2297 0.125 1.839 0.066 -0.015 0.474
forecasting_tags -0.2171 0.062 -3.494 0.000 -0.339 -0.095
forest_tags -0.2411 0.117 -2.053 0.040 -0.471 -0.011
function_tags 0.3345 0.087 3.833 0.000 0.163 0.506
functions_tags -0.3952 0.201 -1.964 0.050 -0.790 -0.001
gaussian_tags 0.1123 0.080 1.411 0.158 -0.044 0.268
generalized_tags -0.0151 0.080 -0.189 0.850 -0.172 0.142
glmm_tags -0.1424 0.096 -1.480 0.139 -0.331 0.046
gradient_tags 0.1263 0.090 1.404 0.160 -0.050 0.303
heteroscedasticity_tags -0.2023 0.107 -1.894 0.058 -0.412 0.007
hierarchical_tags 0.0072 0.098 0.074 0.941 -0.184 0.199
hypothesis_tags -0.0333 4.12e+07 -8.09e-10 1.000 -8.07e+07 8.07e+07
in_tags 0.0382 0.102 0.373 0.709 -0.163 0.239
independence_tags 0.0502 0.081 0.618 0.537 -0.109 0.210
inference_tags -0.0921 0.055 -1.667 0.095 -0.200 0.016
inflation_tags 0.0001 0.107 0.001 0.999 -0.210 0.210
information_tags -0.1374 0.081 -1.687 0.092 -0.297 0.022
interaction_tags -0.2351 0.072 -3.252 0.001 -0.377 -0.093
interpretation_tags 0.0944 0.073 1.292 0.196 -0.049 0.238
interval_tags 0.2379 0.132 1.803 0.071 -0.021 0.497
keras_tags 0.1577 0.102 1.551 0.121 -0.042 0.357
kernel_tags 0.0224 0.121 0.185 0.853 -0.214 0.259
language_tags -0.3193 0.255 -1.250 0.211 -0.820 0.181
lasso_tags 0.2713 0.091 2.977 0.003 0.093 0.450
learn_tags -0.0047 nan nan nan nan nan
learning_tags -0.1611 0.059 -2.744 0.006 -0.276 -0.046
least_tags 0.1462 0.311 0.470 0.638 -0.463 0.755
likelihood_tags 0.0076 0.065 0.117 0.907 -0.120 0.135
linear_tags 0.0529 0.043 1.221 0.222 -0.032 0.138
lme4_tags -0.0099 nan nan nan nan nan
logistic_tags 0.0391 0.048 0.809 0.419 -0.056 0.134
loss_tags 0.2454 0.187 1.313 0.189 -0.121 0.612
lstm_tags -0.1643 0.099 -1.656 0.098 -0.359 0.030
machine_tags 0.1701 0.063 2.705 0.007 0.047 0.293
markov_tags -0.1263 0.064 -1.975 0.048 -0.252 -0.001
mathematical_tags -0.2066 0.078 -2.648 0.008 -0.360 -0.054
matrix_tags 0.1747 0.067 2.620 0.009 0.044 0.305
maximum_tags 0.1095 0.090 1.215 0.224 -0.067 0.286
mean_tags -0.0045 0.061 -0.073 0.942 -0.125 0.116
measures_tags 0.0017 5.3e+06 3.29e-10 1.000 -1.04e+07 1.04e+07
meta_tags 0.2759 0.089 3.110 0.002 0.102 0.450
missing_tags 0.1749 0.113 1.544 0.123 -0.047 0.397
mixed_tags 0.7204 0.049 14.843 0.000 0.625 0.816
model_tags -0.1663 0.105 -1.583 0.113 -0.372 0.040
modeling_tags -0.0326 0.066 -0.491 0.623 -0.162 0.097
models_tags 0.1054 0.089 1.188 0.235 -0.068 0.279
monte_tags -0.0181 nan nan nan nan nan
montecarlo_tags 0.1283 nan nan nan nan nan
multicollinearity_tags -0.1553 0.101 -1.540 0.124 -0.353 0.042
multilevel_tags 0.0982 0.086 1.137 0.255 -0.071 0.267
multinomial_tags 0.0980 0.094 1.044 0.296 -0.086 0.282
multiple_tags -0.0693 0.043 -1.607 0.108 -0.154 0.015
multivariate_tags -0.0244 0.063 -0.386 0.699 -0.148 0.100
natural_tags 0.2808 0.286 0.981 0.326 -0.280 0.842
negative_tags 0.3065 0.130 2.350 0.019 0.051 0.562
network_tags 0.1249 0.128 0.978 0.328 -0.125 0.375
networks_tags 0.2331 0.152 1.533 0.125 -0.065 0.531
neural_tags -0.2099 0.154 -1.366 0.172 -0.511 0.091
nlme_tags -0.0099 nan nan nan nan nan
nonlinear_tags -0.2817 0.093 -3.014 0.003 -0.465 -0.099
nonparametric_tags 0.0989 0.077 1.285 0.199 -0.052 0.250
normal_tags -0.0006 0.055 -0.012 0.991 -0.109 0.108
normalization_tags -0.1150 0.083 -1.380 0.168 -0.278 0.048
of_tags -0.0935 0.071 -1.315 0.189 -0.233 0.046
optimization_tags -0.1503 0.057 -2.619 0.009 -0.263 -0.038
ordinal_tags -0.0304 0.112 -0.270 0.787 -0.251 0.190
outliers_tags 0.0182 0.106 0.172 0.864 -0.190 0.226
overfitting_tags 0.2214 0.103 2.159 0.031 0.020 0.422
panel_tags -0.1383 0.080 -1.737 0.082 -0.294 0.018
pca_tags -0.0371 0.069 -0.539 0.590 -0.172 0.098
poisson_tags -0.0338 0.063 -0.534 0.593 -0.158 0.090
posterior_tags -0.1392 0.110 -1.263 0.207 -0.355 0.077
power_tags -0.0875 0.355 -0.246 0.805 -0.784 0.608
prediction_tags 0.0833 0.071 1.172 0.241 -0.056 0.223
predictive_tags -0.2722 0.114 -2.390 0.017 -0.495 -0.049
prior_tags -0.0087 0.090 -0.097 0.923 -0.185 0.168
probability_tags 0.2056 0.032 6.413 0.000 0.143 0.268
process_tags -0.0440 0.093 -0.471 0.637 -0.227 0.139
processes_tags -0.6617 0.246 -2.688 0.007 -1.144 -0.179
python_tags -0.1881 0.042 -4.444 0.000 -0.271 -0.105
random_tags 0.2024 0.062 3.256 0.001 0.081 0.324
ratio_tags -0.0608 0.098 -0.624 0.533 -0.252 0.130
reduction_tags -0.0360 7.74e+06 -4.65e-09 1.000 -1.52e+07 1.52e+07
references_tags 0.2471 0.088 2.795 0.005 0.074 0.420
regression_tags 0.0127 0.021 0.609 0.543 -0.028 0.053
regularization_tags 0.2561 0.087 2.946 0.003 0.086 0.426
reinforcement_tags 0.0386 0.110 0.351 0.725 -0.177 0.254
repeated_tags 0.0017 5.34e+06 3.26e-10 1.000 -1.05e+07 1.05e+07
residuals_tags 0.1781 0.089 2.004 0.045 0.004 0.352
sample_tags 0.0288 0.064 0.449 0.653 -0.097 0.155
sampling_tags 0.0849 0.050 1.692 0.091 -0.013 0.183
scikit_tags -0.0047 nan nan nan nan nan
selection_tags -0.0115 0.035 -0.326 0.744 -0.080 0.058
self_tags -0.0390 nan nan nan nan nan
series_tags -0.2308 0.064 -3.596 0.000 -0.357 -0.105
significance_tags -0.3856 0.382 -1.009 0.313 -1.134 0.363
simulation_tags -0.0422 0.060 -0.707 0.480 -0.159 0.075
size_tags -0.1392 0.094 -1.487 0.137 -0.323 0.044
smoothing_tags -0.0870 0.113 -0.769 0.442 -0.309 0.135
spss_tags 0.0479 0.105 0.458 0.647 -0.157 0.253
squared_tags 0.2704 0.109 2.474 0.013 0.056 0.485
squares_tags -0.1948 0.310 -0.628 0.530 -0.802 0.413
standard_tags -0.1023 0.089 -1.143 0.253 -0.278 0.073
stata_tags 0.0042 0.091 0.046 0.963 -0.174 0.182
stationarity_tags 0.2265 0.093 2.432 0.015 0.044 0.409
statistical_tags 0.3912 0.377 1.039 0.299 -0.347 1.129
statistics_tags 0.1814 0.068 2.664 0.008 0.048 0.315
stochastic_tags 0.6387 0.228 2.802 0.005 0.192 1.085
structural_tags 0.2348 0.116 2.029 0.042 0.008 0.462
study_tags -0.0573 nan nan nan nan nan
survey_tags -0.1683 0.073 -2.300 0.021 -0.312 -0.025
survival_tags -0.2339 0.072 -3.253 0.001 -0.375 -0.093
svm_tags 0.0260 0.085 0.307 0.759 -0.140 0.192
tensorflow_tags -0.4961 0.108 -4.607 0.000 -0.707 -0.285
terminology_tags 0.3247 0.093 3.493 0.000 0.142 0.507
test_tags -0.0951 0.037 -2.605 0.009 -0.167 -0.024
testing_tags -0.0333 4.14e+07 -8.04e-10 1.000 -8.12e+07 8.12e+07
theorem_tags 0.1210 0.191 0.632 0.527 -0.254 0.496
theory_tags 0.2070 0.082 2.521 0.012 0.046 0.368
time_tags 0.2015 0.051 3.954 0.000 0.102 0.301
transformation_tags 0.0141 0.082 0.172 0.864 -0.147 0.175
unbalanced_tags -0.1665 1.2e+08 -1.38e-09 1.000 -2.36e+08 2.36e+08
validation_tags -0.0433 0.088 -0.494 0.622 -0.215 0.129
value_tags -0.1477 0.062 -2.369 0.018 -0.270 -0.025
variable_tags -0.1243 0.068 -1.828 0.068 -0.258 0.009
variance_tags -0.0954 0.049 -1.930 0.054 -0.192 0.001
visualization_tags 0.1339 0.081 1.647 0.100 -0.025 0.293
wilcoxon_tags 0.3231 0.102 3.153 0.002 0.122 0.524
===========================================================================================
Execution error
The accuracy of the Logistic Regression using all features from the dataframe is 0.528. In this section, I have decided to include all features because the words included in the questions are already filtered in Part A.
LDA
The accuracy of the LDA model from this trial is 0.529, just slightly higher than Logistic Regression.
CART
Fitting 10 folds for each of 11 candidates, totalling 110 fits
The accuracy of CART mode is 0.527. Disclaimer that the optimal ccp-alpha may not be accurate as the granularity of the cross validation linspace is more rough than the default setting.
Random Forests
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76990 entries, 0 to 76989
Columns: 618 entries, abl_body to wilcoxon_tags
dtypes: int64(618)
memory usage: 363.0 MB
The accuracy of Random Forest model is 0.500, which is the lowest of all so far. However, it is too early to conclude this model's qualification as we haven't gone through bootstrapping to validate models.
Boosting
The accuracy of the Boosting model in this trial is 0.514.
Bootstrapping for final model
After calculating the quality of the models using "accuracy", I've started to believe that TPR may be the better metrics for this scenario as we aim to maximize the number of useful questions popping up at the top. Below is my attempt to apply that; note that the visualizations are not present as I wasn't able to solve the error "bootstrap_validation() got multiple values for argument 'metrics_list'" (shown under Boostrap for LDA section). Sample size is minimized to 500 to prevent the file from running too long. The code for all models are still shown here.
Bootstrap for Logistic Regression
Bootstrap for LDA
Execution error
Bootstrap for CART
Bootstrap for RF
Bootstrap for Boosting
If these cells above worked, I would be able to see the distributions of TPR for all the models considered in this assignment and see the difference in its means, standard deviation, and potentially get an intuition of which model may have the highest TPR, and thus the highest quality.
Part C: Best Model
In Part C, I have included the code that calculate 95% Confidence Intervals of the distribution derived above. This would help determine which models do not share the interval, and therefore, safe to say the quality distribution is different from one another. Again, unfortunately, the calculations are not complete as the code requires values that were to be calculated in the previous section. I apologize for the inconvenience for grading purpose.
Though I do not have the actual 95% CI values for the models as I hoped, I predict that it is very difficult to identify a specific model that best accomplishes the goal as the accuracy of the models seem to be fairly close among each other--if the 95% CI overlap, it is most likely impossible to conclude on one model that works the best. However, the difference may be more visible when TPR is used as the metric as opposed to accuracy.
From the models measured by accuracy, CART seems to have one of the highest accuracy. In the cells below, I will quantify the quality of the model using TPR after optimizing the ccp_alpha value.
Fitting 10 folds for each of 11 candidates, totalling 110 fits
Unfortunately, cross validation on CART results in very low TPR score as shown above. To shift my direction, I have decided to examine the model with the lowest accuracy score from the previous section--random forest. In the cells below, I've attempted small sample CV on Random Forest to see how the quality changes.
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END max_features=1, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 31.3s
[CV] END max_features=1, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 30.9s
[CV] END max_features=1, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 31.3s
[CV] END max_features=1, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 30.7s
[CV] END max_features=1, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 31.1s
[CV] END max_features=2, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 39.3s
[CV] END max_features=2, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 39.3s
[CV] END max_features=2, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 39.8s
[CV] END max_features=2, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 39.6s
[CV] END max_features=2, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 39.6s
[CV] END max_features=3, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 47.9s
[CV] END max_features=3, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 48.2s
[CV] END max_features=3, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 48.6s
[CV] END max_features=3, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 48.0s
[CV] END max_features=3, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 48.0s
[CV] END max_features=4, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 55.9s
[CV] END max_features=4, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 55.9s
[CV] END max_features=4, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 56.5s
[CV] END max_features=4, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 56.1s
[CV] END max_features=4, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 57.5s
[CV] END max_features=5, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.1min
[CV] END max_features=5, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.1min
[CV] END max_features=5, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.1min
[CV] END max_features=5, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.1min
[CV] END max_features=5, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.1min
[CV] END max_features=6, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.2min
[CV] END max_features=6, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.2min
[CV] END max_features=6, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.2min
[CV] END max_features=6, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.2min
[CV] END max_features=6, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.2min
[CV] END max_features=7, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.3min
[CV] END max_features=7, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.3min
[CV] END max_features=7, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.3min
[CV] END max_features=7, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.3min
[CV] END max_features=7, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.3min
[CV] END max_features=8, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.4min
[CV] END max_features=8, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.4min
[CV] END max_features=8, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.4min
[CV] END max_features=8, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.4min
[CV] END max_features=8, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.4min
[CV] END max_features=9, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.5min
[CV] END max_features=9, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.5min
[CV] END max_features=9, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.6min
[CV] END max_features=9, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.6min
[CV] END max_features=9, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.6min
[CV] END max_features=10, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.7min
[CV] END max_features=10, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.6min
[CV] END max_features=10, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.7min
[CV] END max_features=10, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.6min
[CV] END max_features=10, min_samples_leaf=5, n_estimators=500, random_state=88; total time= 1.7min
For sake of preserving the running time, I have kept the number of features as the range from 1 to 10. This causes the model to lose precise picture of improvement among different hyperparameters. With that being said, the TPR for Random Forest model is the highest at 0.858. The change in TPR is shown above as the result of calling 'scores'. Since the model with the maximum number of features has the highest TPR, we do not know if this score is the highest possible number; ideally, we would compute cross validation with higher number of features (say, up to 50 features) or until we observe the decline in CV TPR value. Though I haven't explored every other models to optimize, this random forest model with 10 features may be a strong candidate for the best model.