Import pandas for the purpose of creating a pandas series and reading in some datasets
read in the jams tsv file
b'Skipping line 7872: expected 7 fields, saw 8\nSkipping line 11730: expected 7 fields, saw 9\nSkipping line 14131: expected 7 fields, saw 8\nSkipping line 58054: expected 7 fields, saw 8\nSkipping line 58754: expected 7 fields, saw 8\n'
b'Skipping line 847129: expected 7 fields, saw 8\n'
b'Skipping line 1091153: expected 7 fields, saw 8\nSkipping line 1175375: expected 7 fields, saw 8\n'
b'Skipping line 1225935: expected 7 fields, saw 8\nSkipping line 1255357: expected 7 fields, saw 8\nSkipping line 1279671: expected 7 fields, saw 8\n'
b'Skipping line 1330675: expected 7 fields, saw 8\n'
b'Skipping line 1448033: expected 7 fields, saw 8\nSkipping line 1543893: expected 7 fields, saw 8\n'
b'Skipping line 1579569: expected 7 fields, saw 8\nSkipping line 1612448: expected 7 fields, saw 8\n'
b'Skipping line 1784588: expected 7 fields, saw 8\n'
the jams tsv file contains data on different users favorites songs from 2011 to 2015
create a pandas series that displays the full set of user_ids in the dataset
with the output of the series, one can see every user_id in the tsv file
for the purpose of time, utilizing every piece of data in the dataset may take to long. Using the sample function we can select a random subset of users to work with.
upon output, one can notice the random subset based on the out of order indexes.
now we can create a dataframe that displays the columns relating to the user_ids in our random sample
upon output one can notice the user_id and the relating columns
now all we need is the user_id, artist and title coulmns in the tsv file.
with the output one can notice the dropping of a couple of columns
Next we are creating a new column called final_jam. This new column combines the song title and the artist of our sample.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
now that our final_jam column contains both the song title and artist we do not need ether of these columns anymore.
finally we can export this data frame to a csv file for the purpose analyzing in class.
read in the new csv file just to make sure everything looks right. Index is set to false so our sample is reordered correctly.