Memory in Speech Processing

This very short post is just to jot down an epiphany I had about a connection between two different approaches used in speech processing.

One popular kind of feature extracted from raw speech data is an MFCC. I don’t want to dwell too much on what exactly this is, but it basically involves taking two Fourier transforms (so it’s the spectrum of a spectrum), which works well, but is completely mystifying. The first Fourier transform makes sense, since sound waves are a periodic phenomenon, but the second one makes much less sense.

Another group of approaches that are growing in popularity are recurrent neural networks, which are basically networks that have some notion of “memory” built in to them. This means that inputs seen at previous times can be used for prediction for a present input, although details vary a lot depending on the particular implementation.

The punchline is that, on some level, these are doing similar things. Or rather, the MFCC is also relying on the idea that data from previous time points are influencing the current data point.

To see this, suppose our data are ${x_1, x_2,\cdots,}$ and that they follow a model of the form

$\displaystyle x_t=s_t+As_{t-D},$

for some other stationary series ${s_t}$ and constants ${A, D}$. The recurrent neural network would try to learn that ${x_{t-D}}$ carries information about ${x_t}$ (since it is designed to have access to ${x_{t-D}}$. What about the MFCC?

It’s helpful to directly compute the spectrum, starting with the auto-covariance. If ${\gamma_s(h)=\text{cov}(s_{t+h}, s_t)}$ is the auto-covariance of ${s_t}$, then the auto-covariance of ${x_t}$ is

$\displaystyle \gamma_x(h)=\text{cov}(x_{t+h}, x_t)=\text{cov}(s_{t+h}+As_{t-D+h}, s_t+As_{t-D})$

which, because the covariance is bilinear, becomes

$\displaystyle (1+A^2)\gamma_s(h)+A\gamma_s(h+D)+A\gamma_s(h-D).$

Then the theoretical spectrum is

$\displaystyle f_x(\omega)=\sum_{h=-\infty}^{\infty} \gamma_x(h)e^{-2\pi i \omega},$

which because of our earlier computation expands to

$\displaystyle \sum_{h=-\infty}^{\infty}((1+A^2)\gamma_s(h)+A\gamma_s(h+D)+A\gamma_s(h-D)) e^{-2\pi i \omega h}.$

This can be rewritten as

$\displaystyle (1+A^2)f_s(\omega)+Ae^{2\pi i \omega D}\sum_{h=-\infty}^{\infty} \gamma_s(h+D)e^{-2\pi i \omega (h+D)}+Ae^{-2\pi i \omega D}\sum_{h=-\infty}^{\infty} \gamma_s(h-D)e^{-2\pi i \omega (h-D)},$

and then simplified to

$\displaystyle (1+A^2)f_s(\omega)+Af_s(\omega)(e^{2\pi i \omega D}+e^{-2\pi i \omega D})=(1+A^2+2A\cos (2\pi \omega D))f_s(\omega).$

So the spectrum actually has a periodic component, and taking a second Fourier transform means sussing out this periodicity.

This means then, that MFCCs and recurrent neural networks are actually making similar assumptions about the data, or at least trying to exploit the same kind of structure. I like this similarity, especially because it suggests that thinking about the effect of previously seen data points is the “right” way to think about speech data.