Importing the data into a pandas data frame. Then, a series of all unique user ID's is created from the data frame.
import pandas as pd df = pd.read_csv('jams.tsv', delimiter='\t', error_bad_lines=False) user_id_array = df['user_id'].unique() user_id_df = pd.Series(user_id_array)
b'Skipping line 7872: expected 7 fields, saw 8\nSkipping line 11730: expected 7 fields, saw 9\nSkipping line 14131: expected 7 fields, saw 8\nSkipping line 58054: expected 7 fields, saw 8\nSkipping line 58754: expected 7 fields, saw 8\n' b'Skipping line 847129: expected 7 fields, saw 8\n' b'Skipping line 1091153: expected 7 fields, saw 8\nSkipping line 1175375: expected 7 fields, saw 8\n' b'Skipping line 1225935: expected 7 fields, saw 8\nSkipping line 1255357: expected 7 fields, saw 8\nSkipping line 1279671: expected 7 fields, saw 8\n' b'Skipping line 1330675: expected 7 fields, saw 8\n' b'Skipping line 1448033: expected 7 fields, saw 8\nSkipping line 1543893: expected 7 fields, saw 8\n' b'Skipping line 1579569: expected 7 fields, saw 8\nSkipping line 1612448: expected 7 fields, saw 8\n' b'Skipping line 1784588: expected 7 fields, saw 8\n'
Drawing a random sample from the dataset because of its large size. A sample size of 1500 should still allow us to draw valuable conclusions from the data while also allowing the code to run much faster. 'random_state' argument is used to replicate the process below using the same sample.
sample_users = user_id_df.sample(n=1500, random_state=1) sample_users.name = 'User IDs'
In the code below, we first filter the data so that it only includes rows from the users randomly selected in the step above. We then removed all but 3 columns from the data frame which we are interested in.
sample_df = df[df['user_id'].isin(sample_users)]
Filtering the sample data into three columns of interest: User ID, artist of song listened to, and the title of the song
sample_df = sample_df[['user_id', 'artist', 'title']]
Creating a new column in our sample data, combining the song and artist columns into one
sample_df['song'] = sample_df['title'] + ', by ' + sample_df['artist']
Dropping the artist and title columns now that we merged them into the song column
sample_df = sample_df.drop(['artist', 'title'], axis=1) print(sample_df)
user_id \ 819 beb66dbb2dfb84641aca09ec0a5cc02a 1425 58b7da09ae4259b00ad2c1a9a66c2f5d 1501 58b7da09ae4259b00ad2c1a9a66c2f5d 1554 58b7da09ae4259b00ad2c1a9a66c2f5d 1630 58b7da09ae4259b00ad2c1a9a66c2f5d ... ... 2089807 422d10535044b9ecb49475f8eed21c79 2089866 e217c169981ef97a61ade22e31fb7bca 2089918 2e354cb4b77916289fcab1875983eec5 2089939 3bae1c3329b4da610f85969adbbe18b7 2089946 02b904da69c493bff4d6f9506b06d6c2 song 819 Gemini, by Marek Hemann 1425 Lucky Saddle, by Ffwd 1501 Ketamine Entity, by Higher Intelligence Agency 1554 Panda Panda Panda, by Deerhoof 1630 Not Forgotten (Original Mix), by Leftfield ... ... 2089807 Santa Monica Dream, by Angus & Julia Stone 2089866 Valse Adieu, by CÃ©line Dion 2089918 Last Goodbye (Live), by Jeff Buckley 2089939 gimme some lovin (stereo), by The Spencer Davi... 2089946 Another Star, by Stevie Wonder [23078 rows x 2 columns]
Exporting our manipulated sample data frame into a new csv file