Rain in Australia
Import libraries
Read the information. The data was obtained from Kagle.
"This dataset contains about 10 years of daily weather observations from many locations across Australia.
RainTomorrow is the target variable to predict. "
Explore
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 145460 non-null object
1 Location 145460 non-null object
2 MinTemp 143975 non-null float64
3 MaxTemp 144199 non-null float64
4 Rainfall 142199 non-null float64
5 Evaporation 82670 non-null float64
6 Sunshine 75625 non-null float64
7 WindGustDir 135134 non-null object
8 WindGustSpeed 135197 non-null float64
9 WindDir9am 134894 non-null object
10 WindDir3pm 141232 non-null object
11 WindSpeed9am 143693 non-null float64
12 WindSpeed3pm 142398 non-null float64
13 Humidity9am 142806 non-null float64
14 Humidity3pm 140953 non-null float64
15 Pressure9am 130395 non-null float64
16 Pressure3pm 130432 non-null float64
17 Cloud9am 89572 non-null float64
18 Cloud3pm 86102 non-null float64
19 Temp9am 143693 non-null float64
20 Temp3pm 141851 non-null float64
21 RainToday 142199 non-null object
22 RainTomorrow 142193 non-null object
dtypes: float64(16), object(7)
memory usage: 25.5+ MB
0
Albury
13.4
1
Albury
7.4
2
Albury
12.9
3
Albury
9.2
4
Albury
17.5
There is imbalanced data, but just when the imbalanced datasets are extreme (i.e.90% for one class and 10% for the other) it will be necessary to require adjustment.
Split
We are going to split the data frame in X and y and then we are going to use train test plit to obtain X train, X vt, y train and y vt. Finally, we are going to use train test plit again to obtain X val, X test, y val and y test.
(142193, 17)
(99535, 17)
(28438, 17)
(14220, 17)
Model
Baseline
We calculate the baseline accuracy score for our model
Baseline Accuracy: 0.78
Iterate
We build the model and fit the model.
Evaluate
Now we evaluate the model. First, we calculate the training accuracy, then the validation accuracy.
Training Accuracy: 1.0
Validation Accuracy: 0.78
The Training Accuracy is 1, but the Validation Accuracy is 0.78. This means our model is not generalizing well.
We are going to use the get_depth method on the DecisionTreeClassifier in our model to see how deep our tree grew during training.
Tree Depth: 41
The Tree Depth is 41. This is much flexibility to our model. So, now we are going to create a range of possible values for max_depth hyperparameter of our model's DecisionTreeClassifier
We are going to create empty lists for training and validation accuracy scores. We are going to train a model for every max_depth in depth_hyperparams. Every time a new model is trained, the code also calculates the training and validation accuracy scores and append them to the training_acc and validation_acc lists, respectively.
Training Accuracy Scores: [0.8158738132315266, 0.8292359471542673, 0.8390515898930024]
Validation Accuracy Scores: [0.8144032632393277, 0.8288909205991982, 0.8376468106055278]
We are going to plot the Accuracy Scores: training_acc and the validation_acc vs Max Depth
In the graphic we can see that in the Max Depth equal to 5 the Accuracy Scores: training_acc and the validation_acc have the highest values for both. So we are going to create a model with this Max Depth, and we are going to fit it.
Now we are going to calculate the Accuracy Scores: acc_train and acc_val.
Training Accuracy: 0.84
Validation Accuracy: 0.84
We can see that the Accuracy Scores are 0.84 for both of them. We solve the overfitting problem.
Fianally, we test de model with the test data.
Test Accuracy: 0.83
The Test Accuracy Scores is 0.83. Everything is fine.
Communicate
Now we are using plot_tree to create a graphic to visualize the decision logic of our model.
We obtanaine the features and the importances of the model.
Features: Index(['Location', 'MinTemp', 'MaxTemp'], dtype='object')
Importances: [0.0018523 0.00612647 0. ]
We plot the Gini Importance vs Feature of the model.
We can see that the most important feature is Humidity3pm.