# HW: How to Assess Models

In this homework, we'll be looking at a dataset of the top 500 movies by production budget -- i.e. the 500 most expensive films ever made, as found on the film data website The Numbers. Original Kaggle dataset can be found here.

## Set-up

Below, we read the dataset into Pandas, then normalize only the numerical columns. Check here for the documentation for sklearn preprocessing's normalize function.

0

1

2019-04-23

1

2

2011-05-20

2

3

2015-04-22

3

4

2015-12-16

4

5

2018-04-25

## Normalization + Splitting into train & test datasets

Split the dataframe df into training and test sets using train_test_split. If you forgot how, check out the documentation! Fill in the blank below.

414

415

2002-11-15

116

117

2017-07-20

470

471

2019-11-14

263

264

2005-07-29

146

147

2014-07-09

## Evaluation of a Regression Model

Here, we're going to train a regression model on the numerical columns of this dataset, to try and predict the Worldwide Gross Earnings of each movie. From there, we'll use evaluation methods for regression models that we learnt in lecture!

Below, we define the predictor and prediction columns in both the train and test datasets. X refers to the predictor dataset, and Y refers to the column we're trying to predict.

414

415

0.3641550787134771

116

117

0.14228890860322524

470

471

0.13272159458211324

263

264

0.5943802139205936

146

147

0.5957906382445246

### Training a linear model

So we trained a model -- how can we visualize its performance on the test set?

Predict the Y values based on the train and test predictor sets (called X). Fill in the blanks below.

### Evaluation Metrics

Other than visualizing the performance on the test set, we can quantify it. As we explained in class, there are different kinds of mean error we could be looking at: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, and R-squared. We'll focus on rMSE here.

Here's a function that calculates the rMSE for you:

Now use the function above to calculate the rMSE for the train and test sets. Fill in the blanks below. Use:

Y_train and Y_train_pred

Y_test and Y_test_pred

```
Training RMSE: 0.8219626331593956
Test RMSE: 0.3947338347287329
```

Looks like our model did better on the test set than the train set! That's great.

## Evaluation of a Classification Model

Moving onto the application of error evaluation to a classification model. Here, we're going to train a classification model on this dataset, to try and predict the genre of each movie.

It looks like 42% of the movies in this dataset are Action movies. Maybe you just watched Top Gun Maverick, and you're looking for another movie in the action genre. Let's see whether we can predict whether a movie is in the action genre using this dataset.

We'll conduct logistic regression, which is a statistical model that models the probability of an event taking place. Here, the event would be if the movie in question is in the action genre.

### Training a logistic regression model

What genre are we predicting? Fill in the blanks below.

414

415

0.8863119988913408

116

117

0.9185581546153849

470

471

0.9109333545099008

263

264

0.7604144454905883

146

147

0.7241572989888352

We've gotten an array of predictions: True for action movies; False for non-action movies.

### Evaluation Metrics

Accuracy is defined as the number of correct predictions / the number of total predictions.

Check if the predictions of the X train/test sets are the same as the original Y values

```
Train accuracy: 0.5697
Test accuracy: 0.5929
```

But accuracy isn't everything. Let's look at a confusion matrix instead. Here's the documentation for the sklearn.metrics function.

```
Confusion matrix, without normalization
[[54 21]
[25 13]]
```

From the confusion matrix above, what is the number of false negatives?

54

Your answer:

Looks like both our accuracy and the confusion matrix indicate that our model is pretty bad at predicting whether a movie is in the action genre. The confusion matrix, however, indicates that most of that low accuracy is driven by labels that are wrongly predicted as 'False' when they are actually 'True' -- movies that are actually action movies but are not predicted as such.