Rain in Australia

Import libraries

import matplotlib.pyplot as plt import pandas as pd from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline, make_pipeline from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn.preprocessing import OrdinalEncoder from sklearn.impute import SimpleImputer

Read the information. The data was obtained from Kagle.

https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package?resource=download

"This dataset contains about 10 years of daily weather observations from many locations across Australia.

RainTomorrow is the target variable to predict. "

df=pd.read_csv("/work/weatherAUS.csv")

Explore

df.info()

df=df[df["RainTomorrow"].notnull()] columns_drop=["Date","Sunshine", "Evaporation", "Cloud9am", "Cloud3pm"] df=df.drop(columns=columns_drop)

df.head()

df.select_dtypes("object").nunique()

df["RainToday"] = df["RainToday"].apply(lambda x: 1 if x == "Yes"else "0") df["RainTomorrow"] = df["RainTomorrow"].apply(lambda x: 1 if x == "Yes"else "0")

df["RainToday"]=df["RainToday"].astype(int) df["RainTomorrow"]=df["RainTomorrow"].astype(int)

df["RainTomorrow"].value_counts(normalize = True).plot( kind="bar", xlabel="Class", ylabel="Relative Frequency", title="Class Balance" );

There is imbalanced data, but just when the imbalanced datasets are extreme (i.e.90% for one class and 10% for the other) it will be necessary to require adjustment.

Split

We are going to split the data frame in X and y and then we are going to use train test plit to obtain X train, X vt, y train and y vt. Finally, we are going to use train test plit again to obtain X val, X test, y val and y test.

target = "RainTomorrow" X = df.drop(columns=target) y = df["RainTomorrow"]

X.shape

y.shape

X_train, X_vt, y_train, y_vt = train_test_split( X, y, test_size = 0.3, random_state = 42 )

X_val, X_test, y_val, y_test = train_test_split( X_vt, y_vt, test_size = 1/3, random_state = 42 )

print(X.shape) print(X_train.shape) print(X_val.shape) print(X_test.shape)

Model

Baseline

We calculate the baseline accuracy score for our model

acc_baseline = y_train.value_counts(normalize=True).max() print("Baseline Accuracy:", round(acc_baseline, 2))

Iterate

We build the model and fit the model.

# Build Model model = make_pipeline( OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), SimpleImputer(), DecisionTreeClassifier(random_state=42) ) # Fit model to training data model.fit(X_train, y_train)

Evaluate

Now we evaluate the model. First, we calculate the training accuracy, then the validation accuracy.

acc_train = accuracy_score(y_train, model.predict(X_train)) acc_val = model.score(X_val, y_val) print("Training Accuracy:", round(acc_train, 2)) print("Validation Accuracy:", round(acc_val, 2))

The Training Accuracy is 1, but the Validation Accuracy is 0.78. This means our model is not generalizing well.

We are going to use the get_depth method on the DecisionTreeClassifier in our model to see how deep our tree grew during training.

tree_depth = model.named_steps["decisiontreeclassifier"].get_depth() print("Tree Depth:", tree_depth)

The Tree Depth is 41. This is much flexibility to our model. So, now we are going to create a range of possible values for max_depth hyperparameter of our model's DecisionTreeClassifier

depth_hyperparams = range(1, 50, 2)

We are going to create empty lists for training and validation accuracy scores. We are going to train a model for every max_depth in depth_hyperparams. Every time a new model is trained, the code also calculates the training and validation accuracy scores and append them to the training_acc and validation_acc lists, respectively.

training_acc = [] validation_acc = [] for d in depth_hyperparams: # Create model with `max_depth` of `d` test_model = make_pipeline( OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), SimpleImputer(), DecisionTreeClassifier(max_depth=d, random_state=42) ) # Fit model to training data test_model.fit(X_train, y_train) # Calculate training accuracy score and append to `training_acc` training_acc.append(test_model.score(X_train, y_train)) # Calculate validation accuracy score and append to `training_acc` validation_acc.append(test_model.score(X_val, y_val)) print("Training Accuracy Scores:", training_acc[:3]) print("Validation Accuracy Scores:", validation_acc[:3])

We are going to plot the Accuracy Scores: training_acc and the validation_acc vs Max Depth

# Plot `depth_hyperparams`, `training_acc` plt.plot(depth_hyperparams, training_acc, label="training") plt.plot(depth_hyperparams, validation_acc, label="validation") plt.xlabel("Max Depth") plt.ylabel("Accuracy Score") plt.legend();

In the graphic we can see that in the Max Depth equal to 5 the Accuracy Scores: training_acc and the validation_acc have the highest values for both. So we are going to create a model with this Max Depth, and we are going to fit it.

# Build Model model = make_pipeline( OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), SimpleImputer(), DecisionTreeClassifier(max_depth=5,random_state=42) ) # Fit model to training data model.fit(X_train, y_train)

Now we are going to calculate the Accuracy Scores: acc_train and acc_val.

acc_train = accuracy_score(y_train, model.predict(X_train)) acc_val = model.score(X_val, y_val) print("Training Accuracy:", round(acc_train, 2)) print("Validation Accuracy:", round(acc_val, 2))

We can see that the Accuracy Scores are 0.84 for both of them. We solve the overfitting problem.

Fianally, we test de model with the test data.

test_acc = model.score(X_test, y_test) print("Test Accuracy:", round(test_acc, 2))

The Test Accuracy Scores is 0.83. Everything is fine.

Communicate

Now we are using plot_tree to create a graphic to visualize the decision logic of our model.

# Create larger figure fig, ax = plt.subplots(figsize=(25, 12)) # Plot tree plot_tree( decision_tree=model.named_steps["decisiontreeclassifier"], feature_names=X_train.columns, filled=True, # Color leaf with class rounded=True, # Round leaf edges proportion=True, # Display proportion of classes in leaf max_depth=3, fontsize=12, # Enlarge font ax=ax, # Place in figure axis );

We obtanaine the features and the importances of the model.

features = X_train.columns importances = model.named_steps["decisiontreeclassifier"].feature_importances_ print("Features:", features[:3]) print("Importances:", importances[:3])

feat_imp = pd.Series(importances, index=features).sort_values() feat_imp

We plot the Gini Importance vs Feature of the model.

feat_imp.plot(kind="barh") plt.xlabel("Gini Importance") plt.ylabel("Feature");

We can see that the most important feature is Humidity3pm.