Music Recommendation Exercise (6/17/21) | Mathaus Silva
In preparation for next class' recommender system, we will be cleaning a database worth of other people's song preferences as well as my own. First, we import the appropriate libraries and load in the music recommendation data.
b'Skipping line 7872: expected 7 fields, saw 8\nSkipping line 11730: expected 7 fields, saw 9\nSkipping line 14131: expected 7 fields, saw 8\nSkipping line 58054: expected 7 fields, saw 8\nSkipping line 58754: expected 7 fields, saw 8\n'
b'Skipping line 847129: expected 7 fields, saw 8\n'
b'Skipping line 1091153: expected 7 fields, saw 8\nSkipping line 1175375: expected 7 fields, saw 8\n'
b'Skipping line 1225935: expected 7 fields, saw 8\nSkipping line 1255357: expected 7 fields, saw 8\nSkipping line 1279671: expected 7 fields, saw 8\n'
b'Skipping line 1330675: expected 7 fields, saw 8\n'
b'Skipping line 1448033: expected 7 fields, saw 8\nSkipping line 1543893: expected 7 fields, saw 8\n'
b'Skipping line 1579569: expected 7 fields, saw 8\nSkipping line 1612448: expected 7 fields, saw 8\n'
b'Skipping line 1784588: expected 7 fields, saw 8\n'
Then, we load the full set of unique user IDs into a pandas Series. That is, not including any user more than once in the series.
Having successfully coded a unique pandas Series, we then use the sample() method to select a random subset of users to work with, so that we don’t have to deal with the entire jams file.
To avoid long computation times, I limited the random subset to 1000 to get a sufficient representation of the full dataset.
Next, we load every jam (from the jams.tsv DataFrame) from every user in the 1000 data sample. On average, there are roughly 15 jams per user, so we should end up with around 15,000 songs.
An output of 16,408 rows indicates the code ran successfully, averaging to around 16 jams per user.
Next, we only need three columns from 'jam': user ID, artist, and song title. We can discard all the remaining columns.
In order to give a song a unique name string, we need to combine the artist and song title into a single column. That is, rather than a column with “Don’t Stop Believin’” for song title and “Journey” as artist, create a new column called “song” that contains text like “Don’t Stop Believin’, by Journey”.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
Now that we have added the 'song' column, we can drop the original title and artist columns so that your final jams DataFrame contains just two columns, user and song.
Having set 'jam' to just two columns, user and song, we now export the DataFrame to a new CSV file named 'jam-sample.csv', which we will use to analyze in class.