Music Recommendation Exercise (6/17/21) | Mathaus Silva
In preparation for next class' recommender system, we will be cleaning a database worth of other people's song preferences as well as my own. First, we import the appropriate libraries and load in the music recommendation data.
Then, we load the full set of unique user IDs into a pandas Series. That is, not including any user more than once in the series.
Having successfully coded a unique pandas Series, we then use the sample() method to select a random subset of users to work with, so that we don’t have to deal with the entire jams file.
To avoid long computation times, I limited the random subset to 1000 to get a sufficient representation of the full dataset.
Next, we load every jam (from the jams.tsv DataFrame) from every user in the 1000 data sample. On average, there are roughly 15 jams per user, so we should end up with around 15,000 songs.
An output of 16,408 rows indicates the code ran successfully, averaging to around 16 jams per user.
Next, we only need three columns from 'jam': user ID, artist, and song title. We can discard all the remaining columns.
In order to give a song a unique name string, we need to combine the artist and song title into a single column. That is, rather than a column with “Don’t Stop Believin’” for song title and “Journey” as artist, create a new column called “song” that contains text like “Don’t Stop Believin’, by Journey”.
Now that we have added the 'song' column, we can drop the original title and artist columns so that your final jams DataFrame contains just two columns, user and song.
Having set 'jam' to just two columns, user and song, we now export the DataFrame to a new CSV file named 'jam-sample.csv', which we will use to analyze in class.