Download the regression and classification dataset:
Unzip both datasets to target folders:
A function to output model predictions to a csv file and directly import it to Kaggle for comparing scores
A regression problem that aims to predict the popularity score of a song
RMSE on Kaggle test data: 8.91431
RMSE on Kaggle data: 9.00005
RMSE on Kaggle Test Data: 7.59224 (score using full training data for training the model)
Conclusion
A classification problem that aims to predict the top genre that a song belongs to
Adult standard was added for null top genre values because it was the most frequent genre, as identified by an earlier plot.
One hot encoding the title and artist for both the training and test datasets
We scaled the data with RobustScaler() as it did perform the best out of all three. StandardScaler follows Standard Normal Distribution (SND); makes mean = 0 and scales the data to unit variance. RobustScaler, however, scales features using statistics that are robust to outliers. This method removes the median and scales the data in the range between 1st quartile and 3rd quartile. i.e., in between 25th quantile and 75th quantile range. This range is also called an Interquartile range.
Accuracy score on Kaggle test data: 0.23214
Accuracy on Kaggle test data: 0.50000
Gaussian Naive Bayes is a variant of Naive Bayes that follows Gaussian normal distribution and supports continuous data. Naive Bayes are a group of supervised machine learning classification algorithms based on the Bayes theorem. It is a simple classification technique, but has high functionality.
Getting the accuracy score but on the training data that we split earlier
Accuracy on Kaggle test data: 0.41071 (for full training data used for training)