We can see from the above functions that there are no missing values.
The above plots are used for multivariate and bivariate analysis. Since, we are already given the features to use there is no need for feature selection or data augmentation.
Decision Trees are prone to overfitting. The above model can go as deep as possible to get leaf nodes with 1 sample. This is why the model overfits. We can see that the cross validation score for each split is 1 meaning that the training accuracy is 1 but the testing accuracy is lower showing overfitting.
By changing the criterion to entropy the model generalizes better. The entropy usually gives a more balanced tree compared to gini while gini performs faster than entropy. Moreover changing the min_samples_leaf to 5 will avoid overfitting as the tree won't branch out leaf nodes with only 1 sample. This helps generalization of the model.
Logistic regression is used for classification problems therefore the likelihood is converted into labels 0 and 1. The logistic regression is used to build the model which classifies the input vector into 0(not cont. in grad prog.) or 1(cont. in grad prog.). The model gives 0.996 accuracy on test data.
RandomForestClassifier is used to classify using columns 'c01', 'c02', ..., 'c10','academic', 'campus', and 'internship' whether the student with these features will get placement (1) or not(0). The randomforest model gives 0.9865 accuracy on test data.
Random_state is equal for all train test split as all the models are trained and tested on same data so that they could be compared. The SLR model with academic feature gives the lowest RMSE error and highest R2 score therefore, could be considered the best among these model. The low RMSE error means the model is best in terms of predictability and high r2 score means it explains more variance compared to the other 2 models.
Random_state is equal for all train test split as all the models are trained and tested on same data so that they could be compared. The simple polynomial regression model with "academic" feature gives the lowest RMSE error and highest R2 score therefore, could be considered the best among these model. The low RMSE error means the model is best in terms of predictability and high r2 score means it explains more variance compared to the other 2 models. All the models use ridge regularization regression.
The MLR model gives 17307.67 RMSE and R2 score of the 0.86. The model performs significantly better than the SLR and polynomial regression models with lower RMSE and higher R2 score.
According to both, the silhouette score and elbow method, 3 is the optimal number of clusters. The elbow is formed at 3 and also for k=3, only 1 of the cluster has less points with -ve silhouette score compared to others.
Various models are trained on different epsilons by iterating over 10 epsilons in between 1 to 20 and min_samples from 3 to 6. The best epsilon value was found 1.0 and best min_samples 3 at which the lowest silhouette score was found that is 0.30368. More values of epsilon and min_samples could be used to find better value of epsilon and min_samples.
For n_components=3 each feature explains variance about 47%, 31%, and 12% respectively.
The PCA features generated that explains 80% variances gives out 3 features.
Only 2 rules where found that have more than 40% support and 90% confidence. The rule ['Data_Mining'] -> ['Machine_Learning'] has a confidence of 98% and lift 1.32. It can be concluded that if student takes Data Mining as elective it is highly probable that he will also take Machine Learning. The other rule ['Data_Structures_and_Algorithms'] -> ['Python_for_Data_Science'] has a confidence of 100% and lift 1.32. It can be concluded that if student takes Data_Structures_and_Algorithms as elective it is highly probable that he will also take Python_for_Data_Science.