⚠️UNDER CONSTRUCTION⚠️
Data Cleaning
Categorical Columns
To enhance our prediction accuracy, we incorporated three categorical columns: belongs_to_collection, genres, and original_language. Since each movie can only belong to one collection, we left it as is for models capable of handling categorical values, and used one-hot encoding for other models. The same technique was employed to encode the genres and language.
Numerical Columns
For XGBoost model, since we chose gbtree as our booster, normalizing or rescaling the numerical features wouldn't have significant impacts on the result (However, if we were to choose booster=gblinear, we would need to normalize the features to achieve a better result).
When it comes to deep learning models, normalization is a must, especially we are dealing with both very large values and relatively smaller ones. For instance, revenue may be in the millions while vote averages are from 0 to 10, and the ratings we are predicting ranges only from 0 to 1. In order to normalize the data, I used the MinMaxScaler form sklearn.preprocessing package.
General Structure Used to Train Models
➖➖➖➖➖➖➖➖➖➖Data preprocessing for the specific model➖➖➖➖➖➖➖➖➖➖
Baseline Model: Stochastic Gradient Descent
Our initial approach is to utilize Stochastic Gradient Descent as our baseline model. With the help of the sklearn library, implementation is straightforward, and only a few parameters require tuning. We were able to achieve a RMSE of 0.26. However, due to the model's simplicity, it seems unlikely that we'll be able to improve upon this any further.
XGBoost
We imported XGBRegressor from the xgboost.sklearn package to help us implement the model. There are lots of parameters we can tune, and I decided to focus on some of the most important ones: min_child_weight, gamma, subsample, colsample_bytree, max_depth.
Neural Networks (MLP, CNN)
Next I implemented several neural network models using PyTorch.
MLP
I also tried Convoluted NN on our dataset, however, quickly I realized it may not be a good choice for our dataset, as it is tabular and has only one channel.
Wide & Deep Learning
So far, all the models are trained for individual users, which means that we haven't explore the movie-to-movie or user-to-user relationships. After implementing and testing many basic models, I chose to stand on the shoulder of the giants (namely Google engineers hahahah) and found the model called Wide & Deep Learning, used for Google Play app recommendations Cheng et al. 2016.