Voice recognition for ES 156 (Spring 2021)
Acknowledgments: the dataset was downloaded from https://github.com/Jakobovski/free-spoken-digit-dataset, all credits to that repo. We also acknowledge Prof. Alejandro Ribeiro (UPenn) whose lecture notes inspired this exercise.
Data cleanup
First we start by importing a few packages.
Next we load the data. By changing the digits in the list digit
, you can import different spoken digits from the dataset. The dataset consists of .wav
files -- you can download it and give it a listen!
The recordings are all sampled at 8KHz. The recordings, however, have different lengths due to different recording durations. The maximum vector length will determine the length of the FFT (shorter vectors will be padded with zeros). This is given by the variable `.
The dictionary signals
will store the recorded spoken digits. So signals[d]
returns a list of vectors that contains recordings for digit d
.
Classification
In order to perform classification, the dataset has to be split in "training" and "testing". The training dataset will be used to determine the average spectrum used in the next classification tasks. The test dataset will be used to evaluate the performance of our method. We create a random split 80/20 split.
Next we compute the average spectral magnitude of the DFT of each digit using the FFT algorithm.
Let $x{i,d}n$ be the discrete signal corresponding to the $i$th instance of digit $d$ in the dataset. For example, each of $x{1,1}n,x{2,1}n,\dots,x{m,1}n$ is the signal corresponding to one of the $m$ instances of digit 1.
We can compute the transform (via the FFT) of each of these signals: $$x{i,d}n\leftrightarrow X{i,d}k. $$
We define the average spectral magnitude of digit $d$ as the average over all instances of that digit in the dataset: $$\bar{X}{d}k \triangleq \sum{i\in \mathrm{instances}}\frac{\left|X_{i,d}k\right|}{\mathrm{number~of~instances}}.$$
We normalize this quantity to create the normalized average spectral magnitude of digit $d$:
$$\tilde{X}{d}k\triangleq\frac{\bar{X}{d}k}{\sqrt{\sum{k'}\left|\bar{X}{d}k'\right|^2}}.$$
Later we will use this as a "signature" for each digit. A new instance of a spoken digit will be classified as digit $d$ if the normalized spectral magnitude of that new signal is similar to the normalized average spectral magnitude of the $d$th digit $\tilde{X}_{d}k$.
Question 1:
Compute $\tilde{X}_{d}k$ for all digits and store them in the dictionary mean_transforms
below.
Question 2: Plot the average spectral magnitude of each digit
You should create a plot that averages the of the magnitude of the FFT transforms of all signals (i.e., recordings) of a given digit.
If you selected digits 1 and 2, you can see that there is a very clear distinction between both digits. We create next the mean classifier for distinguishing between spoken digits. For a given signal $xn\leftrightarrow Xk$, the mean classifier computes the inner product
$$p\left(X,\tilde{X}d\right)\triangleq \sum{k=1}^N \left|Xk\right|\left|\tilde{X}_dk\right|, $$
where $$\left|\tilde{X}_dk\right|$$ is the mean spectral magnitude for digit $d$. We refer to this quantity as the similarity
between $X$ and $\tilde{X}_d$. It quantifies how similar the magnitude of the spectrum of $X$ is with the average spectrum of digit $d$. The mean classifier then outputs as the class of $xn$ the digit with the highest similarity
.