A) Clean and Manipulate the data.
1.Import all the necessary packages and functions you will need to perform the exercise
2. Compute the percentage of NaN for each column and drop the columns that have more than 50% of NaN
3. Create a list which contains all the names of the columns with at least one NaN value. (Hint: % bigger than zero, use command .loc()). Drop the variables (features) with more than 30% of missing values.
4. Fill the NaN using the median value of the column grouped by “Sector”.
number of missing values: 0
5. Check the type of the variables, dividing them among Categorical, Dummy, and Numerical. (Hint: use : select_dtypes(include=[type to include]). Are there any dummy variables? If so, convert them.6. Compute the correlation matrix and show the top 20 variables correlated with 2019_PRICE_VAR.
6. Compute the correlation matrix and show the top 20 variables correlated with 2019_PRICE_VAR.
7. Pick 10 variables between the top 20 in point 6 and plot the scatter plot between them and 2019_PRICE_VAR. N.B.: if the target (Class) in between the top 20 DO NOT PICK IT.
8. Drop the variables that in your opinion are not needed further. You can decide to not drop any variables. N.B.: Drop '2019_PRICE_VAR'
9. Pick 10 out of the 20 top correlated variables at point 6 and compute the nonlinear features, using degree=2 and including the interaction term, not the bias, and add them to the original data as predictors.
10. Define the matrix of the predictors (X), and the target Class (y) from the dataset. (which is Class)
B) Fit and Estimate the models using a proper cross-validation.
1. Split the X and the Y into train set and test set. TEST SIZE = 0.25.
2. Fit the logistic regression using the X without the nonlinear feature computed at point 9.
Intercept: [2.23228142e-17]
3. Evaluate the logistic regression using: accuracy, precision, confusion matrix.
0 1
0 136 875
1 141 2142
Precision train: 0.7099767981438515
Accuracy train: 0.6915604128718883
Sensitivity train: 0.13452027695351138
Specificity train: 0.938239159001314
Roc Auc train: 0.731856282599682
4. Print the classification report.
precision recall f1-score support
0 0.49 0.13 0.21 1011
1 0.71 0.94 0.81 2283
accuracy 0.69 3294
macro avg 0.60 0.54 0.51 3294
weighted avg 0.64 0.69 0.63 3294
5. What can you say about the overall performance of the model? Is it good? Is it better in classifying 1 or 0? Why? How is the sensitivity with respect to 0? and 1? What does this mean?
6. Fit the logistic regression using all the predictors, including also the non-linear term.
Intercept: [7.318248e-18]
7. Evaluate the logistic regression using: accuracy, precision, confusion matrix.
0 1
0 132 879
1 142 2141
Precision train: 0.7089403973509933
Accuracy train: 0.6900425015179114
Sensitivity train: 0.13056379821958458
Specificity train: 0.9378011388523873
Roc Auc train: 0.7300899046103896
8. Print the classification report.
precision recall f1-score support
0 0.48 0.13 0.21 1011
1 0.71 0.94 0.81 2283
accuracy 0.69 3294
macro avg 0.60 0.53 0.51 3294
weighted avg 0.64 0.69 0.62 3294
9. What can you say about the overall performance of the model? Has it improved from the one of the logist regression above? If yes, why?
10. Using the entire dataset at point 10, fit the penalized logistic regression using Ridge penalization. In order to calibrate the optimal α\alphaα use the GridSearchCV function. Use at least 5 folds for the cross-validation (cv=5) and use at least 50 values for alpha. As scoring use the neg_log_loss. Which is the optimal value of α\alphaα. HINT: remember to 'scale' the data.
11. Plot the cross-validation test neg_los_loss value verus the value of alpha. How is the graph? Did you find the optimal α\alphaα value? If no, what could be the issue?
12. Evaluate the Ridge penalized logistic regression using: accuracy, precision, confusion matrix.
13. Using the entire dataset at point 10, fit the penalized logistic regression using LASSO penalization. In order to calibrate the optimal α\alphaα use the GridSearchCV function. Use at least 5 folds for the cross-validation (cv=5) and use at least 50 values for alpha. As scoring use the neg_log_loss. Which is the optimal value of α\alphaα.
14. Plot the cross validation test neg_los_loss value verus the value of alpha. How is the graph? Did you find the optimal α\alphaα value? If no, what could be the issue?
15. Evaluate the Lasso penalized logistic regression using: accuracy, precision, confusion matrix.
16. Has the shrinkage (Ridge or Lasso) improved the overall performance with respect to the Logisitc regression? Why?
17. Using the entire dataset at point 10, fit the Decision Tree Classifier. In order to calibrate the oprimal value for: max_depth and min_samples_split, use the GridSearchCV function. Use at least 5 folds for the cross-validation (cv=5) and use at least 5 values for the max_depth and 5 for the min_samples_split. As scoring use the neg_log_loss. Which is the optimal value of max_depth and min_samples_split?
18. Re-fit the best model and plot the feature importance bar plot. Plot only the top 20 variables.
19. Evaluate the Decision Tree Clssifier using: accuracy, precision, confusion matrix.
20. Using the entire dataset at point 10, fit the Random Forest Classifier. In order to calibrate the oprimal value for: max_depth and min_samples_split, use the GridSearchCV function. Use at least 5 folds for the cross-validation (cv=5) and use at least 5 values for n_estimators, 5 values for the max_depth and 5 for the min_samples_split. As scoring use the neg_log_loss. Which is the optimal value of n_estimators, max_depth, and min_samples_split?
21. Re-fit the best model and plot the feature importance bar plot. Plot only the top 20 variables.
22. Evaluate the Random Forest Classifier using: accuracy, precision, confusion matrix.
23. The top 20 features have changed between Point 18 and 21? If yes, why and how?
24. Compare the performance of the Decision Tree Classifier and the Random Forest Classifier. Which one has the best precision and accuracy? Why?
25. Which is the best model overall in terms of performances? Why? This could be explained by the data structure or by the model structure? Is there room to improve the performance? If so, how?
C) BEST MODEL MONEY PERFORMANCE.
1. Using the best model determined in point 25, with its optimal hyperparameters, do the following:Use the index of y_test to select the corresponding: 2019_PRICE_VAR.Multiply the selected 2019_PRICE_VAR by y_pred, which is a vector of 1 and 0Sum the resulting vector which will be your portfolio price variation.
2. Is the portfolio price variation computed using the test sample positive? If so, what does it mean?