If you plan to re-run cells, please pip install or (through other means) install the necessary packages locally:
The cell below shows an example:
Christopher Edwin Breaux, or by his moniker Frank Ocean, is an award-winning American singer-songwriter who has released multiple projects, with Blonde being his most recent release. He has advanced the genre of R&B with his unconvential production and experimental approach of infusing multiple genres like jazz,funk, soul, psychedelic rock, and even hip-hop to his music.
I find it fascinating that we have access to a whole artist's discography-- it provides access for us fans to compare songs throughout different phases of the artist's career. I want to capture if there are recognizable patterns in Frank Ocean's discography, namely, analyzing his songwriting under a lingustic lens. By running sentiment analsyis on his lyrics, I hope to confirm my prior biases that his albums get more negative over time (with Nostalgia, Ultra. being his "most positive" and with time, Blonde being his most "negative" album. And if possible, I want to compare my findings with the attributes that Spotify provides (which includes musical valence) to see if the instrumentation match the lyrics.
Due to my time being rather limited and since I am P/NP'ing this class -- I won't be able to do as much as I want to do for this project. So if it seems rather sparse, I plan to continue this project on my free time so I can share it online.
The following is the custom script to extract the data from both Spotify and Genius's API. I wanted to make it as general as possible so I can create a dataset for any artist I want (for future analysis, and potentially for others to use it). It first gathers what is necesssary for the Spotify attributes and exports it into a json file -- which is then read into a pandas dataframe. The lyrics from Genius are a lot easier to get as we just export everything into a dataframe through a .csv file that is read much later.
songs.csv
is a .csv file that contains multiple attributes that Spotify provides for each song uploaded to their platform. More information about how these audio features are recorded by Spotify can be found here:
https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/
What is interesting though that some of these features are similar in lingustics, namely, 'speechiness
, and loudness
can be attributed to how the singer sings the song. Given that I do not really have much lingustics background, I plan to focus on valence
which describes the musical positiveness conveyed by a track. As noted by Spotify's API:
Tracks with high valence sound more positive (happy, cheerful, euphoric), while tracks with low valence sound more negative (sad, depressed, angry).
Note that we have 421 songs which is quite a lot -- a lot of them are features or songs attributed to Frank Ocean. We plan to prune these datapoints so we can focus on Frank's albums he has produced only.
We can see this is true by looking at what albums are included:
As we see below, there is a lot of data cleaning that needs to take place. Some album names are missing, and a lot of regex must be used to remove the white space/escape characters. Additionally, Genius includes segment blocks to indicate if it's the verses and chourses, so removing uneccessary stuff is mandatory for doing sentiment analysis. Lastly, some songs are instrumental so we want to manually remove those points as well.
Let's first update the songs we want to specified albums and singles found below:
Unfortunately, nostalgia, ULTRA.
is not on Spotify so only the singles, Swim Good and Novacane will be analyzed.
Now there's only 46
songs to analyze:
There's redudancy in some singles we have to remove (Novacane Edited and Nonexplicit RAF):
Now let's do the same datacleaning to filter the lyrics.csv
to the albums/singles we want:
This is the most complicated part is filtering which information we actually need. The following describes what the below block does:
nltk.download('stopwords')
is ran so you have access to the list of stopwords used. We want to count how many stopwords are removed to see how much words are removed from the original lyrics.(Verse [0-9]|Pre-Chorus [0-9]|Hook [0-9]|Chorus|Outro|Verse|Refrain|Hook|Bridge|Intro|Instrumental|Skit)
As expected, longer songs tend to have more stopwords, while songs with just instrumentals were not altered much.
I was told to cite this paper if I plan to use vaderSentiment, a lexicon rule-based sentiment analysis tool that is mainly used for sentiments expressed in social media. I thought that using VADER was appropriate since a lot of lyrics from Frank has lingo that can be found through social media, that is not captured by a lot of other lexicon systems.
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
We plan to apply the given functions calculate polarity scores of all positive, neutral, and negative sentiments of the lyrics -- rather than a final sentiment score that most lexicon systems use. VADER then provides a compunded score, computed by "summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive)."
The normalizing factor is found below for the compound score:
We calculate valenceDiff -- which is just the Spotify Valence minus the vader compound score calculated by VADER.
Let's answer some questions:
Some things to note, it's highly biased with songs with few lyrics (like Fertilizer as shown below).
The folloing describes why VADER does not work well with lyrics (will be described later in the Conclusion)
This is what happens when we "melt" a dataframe, which is used for plotting through seaborn's pointplot: