HW: How to Assess Models
In this homework, we'll be looking at a dataset of the top 500 movies by production budget -- i.e. the 500 most expensive films ever made, as found on the film data website The Numbers. Original Kaggle dataset can be found here.
Below, we read the dataset into Pandas, then normalize only the numerical columns. Check here for the documentation for sklearn preprocessing's normalize function.
Normalization + Splitting into train & test datasets
Evaluation of a Regression Model
Here, we're going to train a regression model on the numerical columns of this dataset, to try and predict the Worldwide Gross Earnings of each movie. From there, we'll use evaluation methods for regression models that we learnt in lecture!
Below, we define the predictor and prediction columns in both the train and test datasets. X refers to the predictor dataset, and Y refers to the column we're trying to predict.
Training a linear model
So we trained a model -- how can we visualize its performance on the test set?
Other than visualizing the performance on the test set, we can quantify it. As we explained in class, there are different kinds of mean error we could be looking at: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, and R-squared. We'll focus on rMSE here.
Here's a function that calculates the rMSE for you:
Training RMSE: 0.8219626331593956 Test RMSE: 0.3947338347287329
Looks like our model did better on the test set than the train set! That's great.
Evaluation of a Classification Model
Moving onto the application of error evaluation to a classification model. Here, we're going to train a classification model on this dataset, to try and predict the genre of each movie.
It looks like 42% of the movies in this dataset are Action movies. Maybe you just watched Top Gun Maverick, and you're looking for another movie in the action genre. Let's see whether we can predict whether a movie is in the action genre using this dataset.
We'll conduct logistic regression, which is a statistical model that models the probability of an event taking place. Here, the event would be if the movie in question is in the action genre.
Training a logistic regression model
We've gotten an array of predictions: True for action movies; False for non-action movies.
Accuracy is defined as the number of correct predictions / the number of total predictions.
Train accuracy: 0.5697 Test accuracy: 0.5929
But accuracy isn't everything. Let's look at a confusion matrix instead. Here's the documentation for the sklearn.metrics function.
Confusion matrix, without normalization [[54 21] [25 13]]
Looks like both our accuracy and the confusion matrix indicate that our model is pretty bad at predicting whether a movie is in the action genre. The confusion matrix, however, indicates that most of that low accuracy is driven by labels that are wrongly predicted as 'False' when they are actually 'True' -- movies that are actually action movies but are not predicted as such.