** PLEASE NOTE **
If you plan to re-run cells, please pip install or (through other means) install the necessary packages locally:
- spotipy, nltk, textblob, vaderSentiment, lyricsgenius
The cell below shows an example:
Who is Frank Ocean?
Christopher Edwin Breaux, or by his moniker Frank Ocean, is an award-winning American singer-songwriter who has released multiple projects, with Blonde being his most recent release. He has advanced the genre of R&B with his unconvential production and experimental approach of infusing multiple genres like jazz,funk, soul, psychedelic rock, and even hip-hop to his music.
What is this project about?
I find it fascinating that we have access to a whole artist's discography-- it provides access for us fans to compare songs throughout different phases of the artist's career. I want to capture if there are recognizable patterns in Frank Ocean's discography, namely, analyzing his songwriting under a lingustic lens. By running sentiment analsyis on his lyrics, I hope to confirm my prior biases that his albums get more negative over time (with Nostalgia, Ultra. being his "most positive" and with time, Blonde being his most "negative" album. And if possible, I want to compare my findings with the attributes that Spotify provides (which includes musical valence) to see if the instrumentation match the lyrics.
Due to my time being rather limited and since I am P/NP'ing this class -- I won't be able to do as much as I want to do for this project. So if it seems rather sparse, I plan to continue this project on my free time so I can share it online.
The following is the custom script to extract the data from both Spotify and Genius's API. I wanted to make it as general as possible so I can create a dataset for any artist I want (for future analysis, and potentially for others to use it). It first gathers what is necesssary for the Spotify attributes and exports it into a json file -- which is then read into a pandas dataframe. The lyrics from Genius are a lot easier to get as we just export everything into a dataframe through a .csv file that is read much later.
songs.csv is a .csv file that contains multiple attributes that Spotify provides for each song uploaded to their platform. More information about how these audio features are recorded by Spotify can be found here:
What is interesting though that some of these features are similar in lingustics, namely, '
loudness can be attributed to how the singer sings the song. Given that I do not really have much lingustics background, I plan to focus on
valence which describes the musical positiveness conveyed by a track. As noted by Spotify's API:
Tracks with high valence sound more positive (happy, cheerful, euphoric), while tracks with low valence sound more negative (sad, depressed, angry).
Note that we have 421 songs which is quite a lot -- a lot of them are features or songs attributed to Frank Ocean. We plan to prune these datapoints so we can focus on Frank's albums he has produced only.
We can see this is true by looking at what albums are included:
As we see below, there is a lot of data cleaning that needs to take place. Some album names are missing, and a lot of regex must be used to remove the white space/escape characters. Additionally, Genius includes segment blocks to indicate if it's the verses and chourses, so removing uneccessary stuff is mandatory for doing sentiment analysis. Lastly, some songs are instrumental so we want to manually remove those points as well.
Now onto some Data Cleaning:
Let's first update the songs we want to specified albums and singles found below:
nostalgia, ULTRA. is not on Spotify so only the singles, Swim Good and Novacane will be analyzed.
Now there's only
46 songs to analyze:
There's redudancy in some singles we have to remove (Novacane Edited and Nonexplicit RAF):
Now let's do the same datacleaning to filter the
lyrics.csv to the albums/singles we want:
This is the most complicated part is filtering which information we actually need. The following describes what the below block does:
- Make the dataframe more readable by dropping album name as a column (as we are merging with songs later by Title).
- Remove stopwords from lyrics; we are using nltk (please make sure
nltk.download('stopwords')is ran so you have access to the list of stopwords used. We want to count how many stopwords are removed to see how much words are removed from the original lyrics.
- Remove unecessary segments:
(Verse [0-9]|Pre-Chorus [0-9]|Hook [0-9]|Chorus|Outro|Verse|Refrain|Hook|Bridge|Intro|Instrumental|Skit)
- Remove punctuation from the words.
- Remove the Japanese lyrics from one of the songs in Blonde.
As expected, longer songs tend to have more stopwords, while songs with just instrumentals were not altered much.
I was told to cite this paper if I plan to use vaderSentiment, a lexicon rule-based sentiment analysis tool that is mainly used for sentiments expressed in social media. I thought that using VADER was appropriate since a lot of lyrics from Frank has lingo that can be found through social media, that is not captured by a lot of other lexicon systems.
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
We plan to apply the given functions calculate polarity scores of all positive, neutral, and negative sentiments of the lyrics -- rather than a final sentiment score that most lexicon systems use. VADER then provides a compunded score, computed by "summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive)."
The normalizing factor is found below for the compound score:
We calculate valenceDiff -- which is just the Spotify Valence minus the vader compound score calculated by VADER.
Let's answer some questions:
- Most Positive Song (highest vaderPos)
- Most Negative Song (highest vaderNeg)
- Most Positive Song (highest vaderCombined)
- Most Negative Song (least vaderCombined)
Some things to note, it's highly biased with songs with few lyrics (like Fertilizer as shown below).
The folloing describes why VADER does not work well with lyrics (will be described later in the Conclusion)
This is what happens when we "melt" a dataframe, which is used for plotting through seaborn's pointplot:
Let's take the sum of these valenceTypes to see which album is technically more "positive" both lyircally and musically.
In summmary, my hypothesis that lyrically Frank Ocean's discography became more negative over time is true (where the difference among Blonde and Channel Orange is found above. However, musically Spotify describes Blonde to be 'happier' than Channel Orange. The opposite is true for the singles that were released, as shown by figure 3 above.
F test or T-test(s) on VADER sentiments or Spotify attributes?
I thought about doing an F test or t-tests on songs among an album to see how much they differ based on the musical attributes provided by Spotify, but it doesn't seem as related to the current analysis.
I thought about doing a t-test on the computed valences by VADER, but it does not seem applicable to this case since each song is different from each other. I don't mind losing points for not including everything from the rubric.
Any tips or reccomendations on what statistical analysis I can do will help me a lot!\
I thought about doing a classification model of using the valences and spotify attributes as features for classifying any song as positive or negative (but it's really biased on Frank's discrography and might not generalize well).
Sentiment Analysis using VADER does a poor job in understanding the nuance of songwriting. The context is lost when we are removing stop words and analyzing each word individually for its sentiment (and not analyzing the sentiment with the surrounding words as phrases). We can see this is true when it scores "Super Rich Kids" as the most positive song in his discography, but an analysis found from Genius depicts a rather negative theme:
The song elaborates on how a life full of material worth can never fulfill someone like love and happiness can. In the final verse, the character Frank is embodying falls (or jumps) from the roof he started his day on.
As shown by all three figures, none of the valenceDifferences were 0, meaning that both the instrumentation described by Spotify does not match the computed VADER valence from the lyrics. I think by choosing Frank Ocean, this conclusion was to be expected; his production is seen as experimental, and thus does not follow what's "normal" for songwriting.
In the future, I hope to find a better lexicon that can rate phrases rather than individual words so the context is maintained when analyzing lyrics. Additionally, I hope I can learn more NLP methods so I can apply it on any artist's discography (luckily the script I wrote generalizes to any artist).