HW: How to Assess Models

In this homework, we'll be looking at a dataset of the top 500 movies by production budget -- i.e. the 500 most expensive films ever made, as found on the film data website The Numbers. Original Kaggle dataset can be found here.

Set-up

# Importing libraries import pandas as pd from matplotlib import pyplot as plt import seaborn as sns import numpy as np import sklearn from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn import metrics, preprocessing

Below, we read the dataset into Pandas, then normalize only the numerical columns. Check here for the documentation for sklearn preprocessing's normalize function.

# Read dataset into Pandas df = pd.read_csv("/work/top-500-movies.csv") # Drop NaNs df = df.dropna() # Normalize only the numerical columns df_normalized = preprocessing.normalize(df[['worldwide_gross', 'production_cost', 'domestic_gross', 'opening_weekend', 'theaters', 'runtime']]) df_normalized = pd.DataFrame(df_normalized) df_normalized.columns = ['worldwide_gross', 'production_cost', 'domestic_gross', 'opening_weekend', 'theaters', 'runtime'] # Join normalized columns to original dataset for ease of reference df = df.join(df_normalized, rsuffix='_normalized') df # Drop NaNs again df = df.dropna() df.head()

Normalization + Splitting into train & test datasets

Split the dataframe df into training and test sets using train_test_split. If you forgot how, check out the documentation! Fill in the blank below.

# FILL IN THE BLANK training_data, test_data = train_test_split(df, test_size=0.2, random_state=42) training_data.head()

Evaluation of a Regression Model

Here, we're going to train a regression model on the numerical columns of this dataset, to try and predict the Worldwide Gross Earnings of each movie. From there, we'll use evaluation methods for regression models that we learnt in lecture!

Below, we define the predictor and prediction columns in both the train and test datasets. X refers to the predictor dataset, and Y refers to the column we're trying to predict.

# Defining predictor and prediction for training dataset X_train = training_data[['rank', 'production_cost_normalized', 'domestic_gross_normalized', 'opening_weekend_normalized', 'theaters_normalized', 'runtime_normalized']] Y_train = training_data['worldwide_gross_normalized'] # Defining predictor and prediction for test dataset X_test = test_data[['rank', 'production_cost_normalized', 'domestic_gross_normalized', 'opening_weekend_normalized', 'theaters_normalized', 'runtime_normalized']] Y_test = test_data['worldwide_gross_normalized'] X_train.head()

Training a linear model

import sklearn.linear_model as lm linear_model = lm.LinearRegression() # Fit linear model linear_model.fit(X = X_train, y = Y_train)

So we trained a model -- how can we visualize its performance on the test set?

Predict the Y values based on the train and test predictor sets (called X). Fill in the blanks below.

# FILL IN THE BLANKS # Predict worldwide_gross on the train set Y_train_pred = linear_model.predict(X_train) # Predict worldwide_gross on the test set Y_test_pred = linear_model.predict(X_test) # Plot predicted vs true earnings plt.scatter(Y_test, Y_test_pred, alpha=0.5) plt.xlabel("Worldwide Gross Earnings") plt.ylabel("Predicted Earnings") plt.title("True Earnings vs Predicted Earnings");

Evaluation Metrics

Other than visualizing the performance on the test set, we can quantify it. As we explained in class, there are different kinds of mean error we could be looking at: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, and R-squared. We'll focus on rMSE here.

Here's a function that calculates the rMSE for you:

def rmse(actual_y, predicted_y): """ Args: predicted_y: an array of the predictions from the model actual_y: an array of the groundtruth label Returns: The root mean square error between the prediction and the groundtruth """ sum_sq = sum((actual_y - predicted_y) ** 2) return np.sqrt(np.mean(sum_sq))

Now use the function above to calculate the rMSE for the train and test sets. Fill in the blanks below. Use:

Y_train and Y_train_pred

Y_test and Y_test_pred

# FILL IN THE BLANKS train_error = rmse(Y_train, Y_train_pred) test_error = rmse(Y_test, Y_test_pred) print("Training RMSE:", train_error) print("Test RMSE:", test_error)

Looks like our model did better on the test set than the train set! That's great.

Evaluation of a Classification Model

Moving onto the application of error evaluation to a classification model. Here, we're going to train a classification model on this dataset, to try and predict the genre of each movie.

It looks like 42% of the movies in this dataset are Action movies. Maybe you just watched Top Gun Maverick, and you're looking for another movie in the action genre. Let's see whether we can predict whether a movie is in the action genre using this dataset.

We'll conduct logistic regression, which is a statistical model that models the probability of an event taking place. Here, the event would be if the movie in question is in the action genre.

Training a logistic regression model

What genre are we predicting? Fill in the blanks below.

# FILL IN THE BLANKS # Defining predictor and prediction for training dataset X_train2 = training_data[['rank', 'worldwide_gross_normalized', 'production_cost_normalized', 'domestic_gross_normalized', 'opening_weekend_normalized', 'theaters_normalized', 'runtime_normalized']] Y_train2 = training_data['genre'] == 'Action' # Defining predictor and prediction for test dataset X_test2 = test_data[['rank', 'worldwide_gross_normalized', 'production_cost_normalized', 'domestic_gross_normalized', 'opening_weekend_normalized', 'theaters_normalized', 'runtime_normalized']] Y_test2 = test_data['genre'] == 'Action' X_train2.head()

# Training the logistic regression model lr = sklearn.linear_model.LogisticRegression(fit_intercept=True, solver = 'lbfgs') lr.fit(X_train2,Y_train2) lr.predict(X_test2)

We've gotten an array of predictions: True for action movies; False for non-action movies.

Evaluation Metrics

Accuracy is defined as the number of correct predictions / the number of total predictions.

Check if the predictions of the X train/test sets are the same as the original Y values

# FILL IN THE BLANKS train_accuracy = np.sum(lr.predict(X_train2) == Y_train2) / len(X_train2) test_accuracy = np.sum(lr.predict(X_test2) == Y_test2) / len(X_test2) print(f"Train accuracy: {train_accuracy:.4f}") print(f"Test accuracy: {test_accuracy:.4f}")

But accuracy isn't everything. Let's look at a confusion matrix instead. Here's the documentation for the sklearn.metrics function.

cm = sklearn.metrics.confusion_matrix(Y_test2, lr.predict(X_test2)) # Define a function to plot a confusion matrix def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ import itertools if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print('Confusion matrix, without normalization') print(cm) plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) fmt = '.2f' if normalize else 'd' thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label') class_names = ['False', 'True'] # Plot confusion matrix plt.figure() plt.grid(False) plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

From the confusion matrix above, what is the number of false negatives?

Your answer: 37

Looks like both our accuracy and the confusion matrix indicate that our model is pretty bad at predicting whether a movie is in the action genre. The confusion matrix, however, indicates that most of that low accuracy is driven by labels that are wrongly predicted as 'False' when they are actually 'True' -- movies that are actually action movies but are not predicted as such.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}HW: How to Assess Models

Set-up

Normalization + Splitting into train & test datasets

Evaluation of a Regression Model

Training a linear model

Evaluation Metrics

Evaluation of a Classification Model

Training a logistic regression model

Evaluation Metrics

HW: How to Assess Models