⚠️UNDER CONSTRUCTION⚠️

Data Cleaning

Categorical Columns

To enhance our prediction accuracy, we incorporated three categorical columns: belongs_to_collection, genres, and original_language. Since each movie can only belong to one collection, we left it as is for models capable of handling categorical values, and used one-hot encoding for other models. The same technique was employed to encode the genres and language.

Numerical Columns

For XGBoost model, since we chose gbtree as our booster, normalizing or rescaling the numerical features wouldn't have significant impacts on the result (However, if we were to choose booster=gblinear, we would need to normalize the features to achieve a better result).

When it comes to deep learning models, normalization is a must, especially we are dealing with both very large values and relatively smaller ones. For instance, revenue may be in the millions while vote averages are from 0 to 10, and the ratings we are predicting ranges only from 0 to 1. In order to normalize the data, I used the MinMaxScaler form sklearn.preprocessing package.

General Structure Used to Train Models

movies = pd.read_csv('/work/movies_metadata.csv') movie_keywords = pd.read_csv('/work/movies_keywords.csv') train = pd.read_csv('/work/train.csv') test = pd.read_csv('/work/test.csv')

train[['userID', 'movieID']] = train["userId_movieId"].str.split("_", expand=True) train['userID'] = pd.to_numeric(train['userID']) train['movieID'] = pd.to_numeric(train['movieID'])

test[['userID', 'movieID']] = test["userId_movieId"].str.split("_", expand=True) test.drop(columns=['userId_movieId'], inplace=True) test['userID'] = pd.to_numeric(test['userID']) test['movieID'] = pd.to_numeric(test['movieID'])

➖➖➖➖➖➖➖➖➖➖Data preprocessing for the specific model➖➖➖➖➖➖➖➖➖➖

train = pd.merge(train, movies, left_on='movieID', right_on='id', how='left').drop(columns=['id']) test = pd.merge(test, movies, left_on='movieID', right_on='id', how='left').drop(columns=['id'])

submission_list = [] ys_pred = [] for user, user_test in test.groupby('userID'): user_train = train[train['userID'] == user] X = user_train.drop(columns=['rating', 'userID', 'movieID', 'userId_movieId']) y = user_train[['rating']] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) ''' implement the specific model ''' model.fit(X_train, y_train) y_pred = reg.predict(X_test) print(f"user {user}: test RMSE {mean_squared_error(y_test, y_pred, squared=False)} avg {np.mean(ys_pred):.4f}") ys_pred.append(mean_squared_error(y_test, y_pred)) user_val = pd.DataFrame() user_val['userId_movieId'] = user_test['userID'].astype(str) + '_' + user_test['movieID'].astype(str) X_val = user_test.drop(columns=['userID', 'movieID']) user_val['rating'] = reg.predict(X_val) submission_list.append(user_val)

# prepare for submission submission = pd.concat(submission_list) final_submission = submission.drop_duplicates(subset=['userId_movieId']) final_submission['rating'] = final_submission['rating'].apply(lambda x: 1 if x > 1 else x)

Baseline Model: Stochastic Gradient Descent

Our initial approach is to utilize Stochastic Gradient Descent as our baseline model. With the help of the sklearn library, implementation is straightforward, and only a few parameters require tuning. We were able to achieve a RMSE of 0.26. However, due to the model's simplicity, it seems unlikely that we'll be able to improve upon this any further.

XGBoost

We imported XGBRegressor from the xgboost.sklearn package to help us implement the model. There are lots of parameters we can tune, and I decided to focus on some of the most important ones: min_child_weight, gamma, subsample, colsample_bytree, max_depth.

Neural Networks (MLP, CNN)

Next I implemented several neural network models using PyTorch.

MLP

I also tried Convoluted NN on our dataset, however, quickly I realized it may not be a good choice for our dataset, as it is tabular and has only one channel.

Wide & Deep Learning

So far, all the models are trained for individual users, which means that we haven't explore the movie-to-movie or user-to-user relationships. After implementing and testing many basic models, I chose to stand on the shoulder of the giants (namely Google engineers hahahah) and found the model called Wide & Deep Learning, used for Google Play app recommendations Cheng et al. 2016.

Analyzing and Combining Multiple Results

Future Steps

Our prediction didn't take into account the information from the movie_keywords.csv file. However, with additional time, we could use cosine similarity and other NLP techniques to analyze this data and enhance our prediction accuracy.

Actually, we have a few columns that we didn't include in our input data, like information related to the movie's production company. Including these data may or may not have a positive impact on our predictions. It might be worth experimenting with these additional columns to see how they could improve our results.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}⚠️UNDER CONSTRUCTION⚠️