Shazam for ES 156 (Spring 2021).
Acknowledgments: several of the provided functions are taken the https://github.com/worldveil/dejavudataset, all credits to that repo. We also acknowledge Prof. Paul Cuff (formerly at Princeton) whose lab is the basis for this lab's structure.
How does Shazam work?
Shazam is a popular app for recognizing a song from a short audio sample. Shazam started its operations in the year 2000, well before smartphones came out. Back then, you would have called a number on you (dumb) mobile phone, and song identification would be made directly from the mobile phone's microphone. The name of the song would then be texted back to you.
The goal of this programming exercise is for you to implement a simplified version of Shazam. The algorithm behind Shazam is essentially a sequence of Fourier transforms taken over different time windows to form a "spectrogram" of an audio snippet. Only the largest values of the spectrogram are kept. This is done to speed up the matching with a song in a database and to provide robustness against noise. A special type of hashing algorithm is then used to match the spectrogram peaks of the audio recorded with your phone against examples in a database.
Before starting the programming exercise, give this paper a quick look: https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
This is the original "Shazam" paper, where the full algorithm is described in significant detail. We won't implement the whole thing (in particular, we will skip the hashing part), but hopefully by the end of the notebook you will have a solid idea of how Shazam works.
There are a lot of moving parts in this code, but don't be indimidated! If you end up getting stuck or hung-up on something, let the teaching staff know. As usual, we are excited to help you. We also appreciate your feedback and suggestions. Finally, if you find a way of improving the algorithm below, please let us know!
Preliminaries: Importing packages and data cleanup
First we start by importing a few packages we will use. You may have to install some of these packages in your anaconda python setup. Please contact the teaching staff if you run into issues.
We now define a few parameters that will be used throughout the lab. These values define different aspects of how the (real-valued) audio signal will be sampled and transformed into a discrete-time signal. The main components are the number of audio channels (2 -- left and righ), the sampling rate (44.1 KHz), the down sampling factor, and the window size of the FFT (recall that the FFT is just an algorithm that implements the DFT). These value are fairly standard in audio signal processing.
Parameters to pay attention to:
Fan Value: The way Shazam's hashing function works is by hashing pairs of peaks in the audio signal. This means the song's fingerprints are determined not just individual peaks, but also by the differences between peak pairs. (More on this in part 6). The fan value is the maximum number of pairs each peak can be part of. Therefore, it limits the number of pairs (and therefore fingerprints) we have for the signal.
Overlap Ratio: When we take the spectrogram of our signal, we want there to be an overlap between our blocks of FFTs. The overlap ratio is the ratio of the number of samples that overlap to the number of samples in each FFT block. In other words, # of overlapping samples = # of samples per block * overlap ratio
Tuning these values will (likely) change the system's performance. You can try to play around with the parameters but, for your first pass, leave the values below.
Beware cautioned that there is often a tradeoff between system accuracy and system efficiency if/when you attempt to tune these values
Part 1: Data import and cleanup
Next, we will first construct an initial library of songs by doing the following: 1. Open each song in our raw mp3s folder; 2. Extract the data from each of the 2 channels.
The recordings are all sampled at 44.1 KHz. (We will downsample in the next section)
SongDb will store the audio data. So
SongDb[s] returns 2 arrays that contains the data for the first and second channel (i.e., the left and right channel) of the song
Part 2: Preprocessing the Signals
Before we create fingerprints for the songs, we must preprocess the data as follows:
- Combine the two data channels by:
- Taking their mean
- Subtracting mean to eliminate the cluster of peaks at frequency f = 0
- Downsampling the signal so that we don't have an excess of data
The function Preprocess() should take as input a song's channels, and return the processed signal x above.
ProcessedDb will store the processed audio data. So
ProcessedDb[s] returns a single array of the combined, processed data for the song
We do the pre-processing for you, so you can focus on the singal processing aspects of the problem. Nevertheless, please make sure you understand the code below.
Part 3: construct spectrograms
- Now we want to construct the spectrogram of this signal. A (very) useful tool for this is the matplotlib.mlab spectrogram function. We want to specify the following parameters:
- noverlap: the number of samples that will overlap between adjacent chunks
- nfft: is the length of the fft you would like to take (can be the same as window)
- fs: the sampling rate of the signal
You can get the full documentation for the matblotlib.mlab spectrogram function here https://matplotlib.org/api/mlab_api.html#matplotlib.mlab.specgram)
- After getting the spectrogram:
- it is a good idea to use the log of its magnitudes (scaled by a constant factor of ~10) instead of just the raw magnitudes
It is also a good idea to set values returned as ±∞ (from divide by zeros) as 0
Problem 1: implement a function called
getSpectrogram() that takes a signal as an input, and returns an array with the log of the magnitudes as defined above.
Problem 2: create a dictionary called
Spectrograms that stores the spectrogram of each song, and plot the spectrogram of each song.
Your plots should look something like
spectrogram_examples.png included in this folder.
Of course, use the colors and fonts for each plot that you like the most
- The dictionary
Spectrogramswill store the processed audio data. So
Spectrograms[s]returns an array representing the spectrogram data for the song
Part 4: spectrogram local peaks
The algorithm implemented by Shazam does not keep the entire spectrogram of the data. It would be to computationally intensive to match a song with a reference spectrogram by searching accross the entire time-frequency pairs. Moreover, the signal collected by the microphone on your phone is very noisy, so a lot of information in the spectrogram is useless. Instead, we only keep the largest (peak) values of the spectrogram. Those are the largest frequency components that, hopefully, can be clearly distinguished from the noise.
Next, we will implement a function that gets the local peaks of the spectrogram and plot them.
- We use our get_2D_peaks() function defined in utils.py.
- This function will take a spectrogram as input.
- Recall that, by now, your spectrograms are saved in a dictionary.
- The output is a triple of:
- the array frequency indices of each peak
- the array of time indices of each peak
- the array of peaks as (frequency value ,time value)
- The dictionary
Peakswill store the processed audio data. So
Peaks[s]returns an array of the local peaks for the song
swhere each peak is a tuple of (frequency value, time value)
Problem 3: using the function
utils.get_2D_peaks, plot the peaks of the spectrogram on a time-frequency plot. for each spectrogram in the disctionary
Spectrograms. Store the corresponding peak values in a dictionary
Your plots should look something like
2dpeaks_examples.png included in this folder (the colors, fonts, etc. are, of course, up to you).
Peaks dictionary will play the role of the database that Shazam has. Later on, you will try to match the spectrogram peaks in a recording with the peaks in the dictionary
Part 5: thresholding
We want to use only the largest spectrogram peaks for each song, and specifically, the largest peaks in each time segment.
Next, you will simplify the dictionary
Peaks in order to keep on the 20 largest peaks. You can adjust this value to more or less peaks -- more peaks makes the matching more accurate, but the matching process overall much slower. In our case we are only using 5 songs so its not a big deal. If you were matching recordings against millions of songs (for thousands of users simultaneously), the number of peaks could be an issue.
Problem 4: simplify the dictionary
Peaks in order to keep only the 20 largest spectrogram peaks per song.
Part 6: peak pairing and table construction
How can the peaks in our
Peaks dictionary be matched against a recording? One of the main challenges is that our peaks are timestamped, and we don't know exactly to which instant of the song the audio recording that we have corresponds to. Finding this offset can be computationally intensive. Shazam deals with this time offset issue in a very clever way: instead of matching peaks, it looks at the time difference between spectrogram peaks.
The final stage in creating our song fingerprints is to find pairs of peaks, and record a) their respective frequencies and b) the time difference between them. Peak pairs should meet the following constraints:
- The second peak must occur within a certain time interval after the first peak
Each peak can only be part of a certain number of pairs (this pair limit is defined in our globals as FAN_VALUE)
Problem 5: Creat the function getPairs() should take the peaks of a given song and return the peak pairs defined above. This corresponds to the processing done when we record an audio with your phone.
Problem 6: Create a dictionary
LookUpTable that stores the peak pairs and their corresponding songs. So
LookUpTable[p] returns the title of the song containing peak pair
p. This function emulates the database search done by Shazam
Part 7: test from files
Problem 7: Now we can test your Shazam!
- You can access mp3s pulled from youtube video recordings of these songs are stored in the folder test_mp3s
- Take a short snippet of these recordings and run the same fingerprinting process to getting peak pairs
- Match the pairs against our peak pairs LookUpTable and see if you match the correct song!