Rain in Australia
Import libraries
Read the information. The data was obtained from Kagle.
"This dataset contains about 10 years of daily weather observations from many locations across Australia.
RainTomorrow is the target variable to predict. "
Explore
There is imbalanced data, but just when the imbalanced datasets are extreme (i.e.90% for one class and 10% for the other) it will be necessary to require adjustment.
Split
We are going to split the data frame in X and y and then we are going to use train test plit to obtain X train, X vt, y train and y vt. Finally, we are going to use train test plit again to obtain X val, X test, y val and y test.
Model
Baseline
We calculate the baseline accuracy score for our model
Iterate
We build the model and fit the model.
Evaluate
Now we evaluate the model. First, we calculate the training accuracy, then the validation accuracy.
The Training Accuracy is 1, but the Validation Accuracy is 0.78. This means our model is not generalizing well.
We are going to use the get_depth method on the DecisionTreeClassifier in our model to see how deep our tree grew during training.
The Tree Depth is 41. This is much flexibility to our model. So, now we are going to create a range of possible values for max_depth hyperparameter of our model's DecisionTreeClassifier
We are going to create empty lists for training and validation accuracy scores. We are going to train a model for every max_depth in depth_hyperparams. Every time a new model is trained, the code also calculates the training and validation accuracy scores and append them to the training_acc and validation_acc lists, respectively.
We are going to plot the Accuracy Scores: training_acc and the validation_acc vs Max Depth
In the graphic we can see that in the Max Depth equal to 5 the Accuracy Scores: training_acc and the validation_acc have the highest values for both. So we are going to create a model with this Max Depth, and we are going to fit it.
Now we are going to calculate the Accuracy Scores: acc_train and acc_val.
We can see that the Accuracy Scores are 0.84 for both of them. We solve the overfitting problem.
Fianally, we test de model with the test data.
The Test Accuracy Scores is 0.83. Everything is fine.
Communicate
Now we are using plot_tree to create a graphic to visualize the decision logic of our model.
We obtanaine the features and the importances of the model.
We plot the Gini Importance vs Feature of the model.
We can see that the most important feature is Humidity3pm.