What libraries can we import that we will need for later usage?
In this section, all necessary packages were imported into the notebook. Despite the fact that the Jupyter notebook may already have these functionalities built in, it is best for consistency's sake to import the packages anyways. This ensures that the packages are present and, more importantly, it reassures that each package is abbreviated as per the coder's abbreviation preferences.
What data will be used for this project?
Here the given CSV file is imported into the project for later use. The logic() function was included to extract the data in a succinct fashion, as the given file only lists data on every other line. In order to obtain a DataFrame that contains relevant data in each row, the logic() function was necessary to eliminate all blank rows. Additionally, due to computing, package cooperation, and practicality purposes, a sampling of 1000 rows was taken from the data. This way, a more compact, usable DataFrame is derived from the massive data set that simply would not comply with the methods used later on otherwise.
Are all columns named correctly?
The given dataset also did not have a very well-formatted column header set, so that had to be fixed. Each URI column for a new song in a given playlist needed to be numerated. This was done through creating a list from 1 to 500, and assigning each number to a column. This numerates each song within the playlist, and orders the songs. In other words, the first song in the playlist is listed under column 1, the second song is listed under column number 2, and so on.
Are all URIs isolated from their labelling?
Another issue regarding the dataset was that each URI was recorded with the prefix " spotify:track:". In order to effectively analyze this dataset, the URIS needed to be isolated so that the URIs could be compared based upon similarity independent of their prefixes. This was done by iterating over the data set's columns and creating a substring of each of the column's contents in order to exclude the space that the prefix occupied. This way, the common prefix before each URI was effectively discarded.
How can the data be converted to long format?
To start off, the already condensed dataset was further shrunk in order to verify the functionality of any data shifts from wide data to a long format. With 500 songs per playlist and over 1000 playlists in the condensed set, it would be difficult to verifiably observe material change with the data. By cutting the data down to just three playlists, it is relatively easy to identify that the DataFrame melting process not only worked, but the data looks similar to the image created at the outset of this coding block. Long form data is a better option in this scenario as it allows us to analyze the frequency of song occurrence far easier by simply filtering through this column based upon frequency of a song's URI, taking into account the playlist ID. The DataFrame was converted through constructing a list of column names and specifying which group of columns were to be condensed, and using the melt() function to condense the necessary columns.
How can song presence be evaluated within this dataset?
Using the melted data set, the entries into the 'song_id' column are evaluated based upon their occurence. By appending an edge column labelled 'Connected' onto the end of the melted data set, the pivot() function that ensues evaluates each song URI within the 'song_id' column and assigns a 1 if the URI is present within the given playlist. If not, the pivot() method assigns a 0 to the column, due to the fillna() method immediately following the pivot() calculation. This way, each URI that is present within the dataset is checked with each playlist and the respective column is demarcated if the song is present within the playlist.
Does this transformation map to the original dataset?
After the concept is proven on the far smaller dataset, the transformation to demonstrate song presence within the dataset is mapped to the original dataset of 1000 playlists. The code here is identical to the two previous code blocks, but the variables were altered in order to accommodate the original data set. As the preview demonstrates, the results align with the expected outcomes and a boolean matrix to assess the frequency of song appearance has been created.
Does the data account for outliers?
The next step after creating the boolean data matrix to evaluate song presence is to normalize the data through NumPy. This process equalizes the length of each vector (i.e. row/column), and accounts for outliers of disproportionately uncommon taste. Each row is respectively normalized and the result divides the preferred dataset to create a normalized dataset.
Simplify matrix for calculation
The first line of code in this section achieves a massive amount of work in that it breaks down the newly normalized dataset into three main factors, the most important of which is the array of singular values, denoted here as sigma (Σ). The sigma array allows us to then utilize the rho (ρ) function to calculate the error introduced by dropping a given number of rows. A sample number of 211 was entered here for convenience purposes
How do we get the new "blurred" data set?
In order to reconstruct the new data set based upon the data elimination determined in the previous coding block, the sigma Series must be copied and the least relevant data must be discarded. Given that 211 was the deduced number for elimination, the last 211 records (that is, the 211 least relevant songs), are discarded from the copied sigma Series. Then, a matrix is constructed where all data not previously filled in (i.e. blank values in the copy of the sigma Series) are assigned a value of 0. From there, the new approximated sigma matrix is applied to the U and V series to construct an approximated preferences data set.
How could an inflow of data be interpreted using these values?
Using the recently constructed preferred dataset, a few exemplar song URLs were used to demonstrate the functionality of what this tool offers: a playlist of a certain length based upon tracks that the user inputs. These URLs were parsed so that the song URI could easily be extracted, and said URIs were placed into the search vector for location within the data set. This vector assigned each input URI with a 1, while the other URIs would be assigned a zero. Once this series demonstrating song's URI presence within an array was achieved, the series that contained the input data was normalized to account for potential outliers within the data.
Note:
The Spotify song URLs are static, and were chosen arbitrarily based upon the pids and song URIs selected within the sample data. Unfortunately due to the sampling and data distillation of the second code section, electing to select random index values was not an option as there was no guarantee that the index value would exist within the sampled data. Thus, URLs were manually input to achieve a similar functionality within the parameters of the project.
How does this input data match with the preferred matrix?
Though the input data has been formulated and normalized, this input preference needs to be contextualized from the overall data. This is best achieved by applying the approximated preferences matrix to the input data. This results in a vectorized array which rates each user's similarity to the input data.
Which playlists are most similar to the input data?
This ensuing code takes the vector, applies the data to a Pandas Series, and reformats the data to a long format for ease of reading. In essence, this code visually shows the cutoff metric needed to determine which playlists are similar to the input data and which playlists can be disregarded.
Which songs from similar playlists are similar to the input?
This code builds off of the last section in that it filters the similarities series for all "relevant" playlists (in this case any playlists with over a 30% similarity). It then stores the preferences of these users from the normalized data set, compiles the most frequent/popular songs among those users, and displays the top 500 songs for the creation of a playlist based upon the user's input.