Download the regression and classification dataset:
Unzip both datasets to target folders:
A function to output model predictions to a csv file and directly import it to Kaggle for comparing scores
A regression problem that aims to predict the popularity score of a song
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 453 entries, 0 to 452
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 453 non-null int64
1 title 453 non-null object
2 artist 453 non-null object
3 top genre 438 non-null object
4 year 453 non-null int64
5 bpm 453 non-null int64
6 nrgy 453 non-null int64
7 dnce 453 non-null int64
8 dB 453 non-null int64
9 live 453 non-null int64
10 val 453 non-null int64
11 dur 453 non-null int64
12 acous 453 non-null int64
13 spch 453 non-null int64
14 pop 453 non-null int64
dtypes: int64(12), object(3)
memory usage: 53.2+ KB
RMSE on validation data: 12.13173739137971
RMSE on Kaggle test data: 8.91431
RMSE on validation data: 12.905732435706527
RMSE on Kaggle data: 9.00005
RMSE on validation data: 12.061972785704496
RMSE on Kaggle Test Data: 7.59224 (score using full training data for training the model)
Conclusion
A classification problem that aims to predict the top genre that a song belongs to
['title', 'artist', 'top genre']
Take Good Care Of My Baby - 1990 Remastered 2
Please Mr. Postman 2
I Was Made For Lovin' You 1
Mr. Soul 1
(I Just) Died In Your Arms 1
..
My Way 1
I Got 5 On It 1
7 Days 1
Rebel Rebel - 2016 Remaster 1
I'll Be Missing You (feat. 112) 1
Name: title, Length: 451, dtype: int64
Elton John 9
Queen 7
ABBA 7
The Beatles 6
Mariah Carey 5
..
Stargazers 1
Laid Back 1
The Tornados 1
John Denver 1
Gabrielle 1
Name: artist, Length: 345, dtype: int64
adult standards 68
album rock 66
dance pop 61
glam rock 16
brill building pop 16
..
blues 1
glam punk 1
british dance band 1
louisiana blues 1
hip pop 1
Name: top genre, Length: 86, dtype: int64
Adult standard was added for null top genre values because it was the most frequent genre, as identified by an earlier plot.
15
There are 12 numerical variables
The numerical variables are : ['Id', 'year', 'bpm', 'nrgy', 'dnce', 'dB', 'live', 'val', 'dur', 'acous', 'spch', 'pop']
One hot encoding the title and artist for both the training and test datasets
We scaled the data with RobustScaler() as it did perform the best out of all three. StandardScaler follows Standard Normal Distribution (SND); makes mean = 0 and scales the data to unit variance. RobustScaler, however, scales features using statistics that are robust to outliers. This method removes the median and scales the data in the range between 1st quartile and 3rd quartile. i.e., in between 25th quantile and 75th quantile range. This range is also called an Interquartile range.
Accuracy on validation data: 0.31868131868131866
Accuracy score on Kaggle test data: 0.23214
Accuracy on validation data: 0.46153846153846156
Accuracy on Kaggle test data: 0.50000
Gaussian Naive Bayes is a variant of Naive Bayes that follows Gaussian normal distribution and supports continuous data. Naive Bayes are a group of supervised machine learning classification algorithms based on the Bayes theorem. It is a simple classification technique, but has high functionality.
Getting the accuracy score but on the training data that we split earlier
Model accuracy score: 0.2857
precision recall f1-score support
adult standards 1.00 0.22 0.36 23
afropop 0.00 0.00 0.00 1
album rock 1.00 0.60 0.75 10
alternative metal 0.00 0.00 0.00 0
art pop 0.00 0.00 0.00 0
art rock 0.00 0.00 0.00 2
atl hip hop 1.00 0.33 0.50 3
barbadian pop 1.00 0.67 0.80 3
bebop 0.00 0.00 0.00 1
belgian pop 0.00 0.00 0.00 0
blues rock 0.00 0.00 0.00 0
bow pop 0.00 0.00 0.00 1
boy band 1.00 1.00 1.00 2
brill building pop 0.00 0.00 0.00 4
brit funk 0.00 0.00 0.00 1
british blues 0.00 0.00 0.00 1
british folk 0.00 0.00 0.00 1
british invasion 0.00 0.00 0.00 1
british soul 0.00 0.00 0.00 1
britpop 0.00 0.00 0.00 1
classic rock 0.00 0.00 0.00 0
classic soul 0.00 0.00 0.00 0
classic uk pop 0.00 0.00 0.00 2
country rock 0.00 0.00 0.00 1
dance pop 1.00 0.21 0.35 14
dance rock 0.00 0.00 0.00 2
deep adult standards 0.00 0.00 0.00 1
disco house 0.00 0.00 0.00 1
doo-wop 0.00 0.00 0.00 0
east coast hip hop 0.00 0.00 0.00 0
eurodance 0.14 0.50 0.22 2
europop 1.00 0.75 0.86 4
g funk 0.00 0.00 0.00 0
glam punk 0.00 0.00 0.00 1
glam rock 1.00 1.00 1.00 2
hip hop 0.00 0.00 0.00 1
merseybeat 0.00 0.00 0.00 1
new wave pop 0.00 0.00 0.00 1
permanent wave 0.00 0.00 0.00 1
pop 0.25 1.00 0.40 1
soft rock 0.00 0.00 0.00 0
accuracy 0.29 91
macro avg 0.20 0.15 0.15 91
weighted avg 0.68 0.29 0.36 91
Accuracy on Kaggle test data: 0.41071 (for full training data used for training)