Mercer’s Theorem and SVMs

In a funny coincidence, this post has the same basic structure as my previous one: proving some technical result, and then looking at an application to machine learning. This time it’s Mercer’s theorem from functional analysis, and the kernel trick for SVMs. The proof of Mercer’s theorem mostly follows Lax’s Functional Analysis.

1. Mercer’s Theorem

Consider a real-valued function ${K(s,t)}$, and the corresponding integral operator ${\mathbf{K}: L^2[0,1]\rightarrow L^2[0,1]}$ given by

$\displaystyle (\mathbf{K} u)(s)=\int_0^1 K(s,t) u(t)\, dt.$

We begin with two facts connecting the properties of ${K}$ to the properties of ${\mathbf{K} }$.

Proposition 1. If ${K}$ is continuous, then ${\mathbf{K} }$ is compact.

Proof: Consider a bounded sequence ${\{f_n\}_{n=1}^{\infty} \subset L^2[0,1]}$. We wish to show that the image of this sequence, ${\{\mathbf{K} f_n\}_{n=1}^{\infty}}$, has a convergent subsequence. We show that ${\{\mathbf{K} f_n\}}$ is equicontinuous, and Arzela-Ascoli then gives a uniformly convergent subsequence, which in turns gives that ${\mathbf{K} }$ is compact.

Since ${K(x,y)}$ is continuous, for every ${\epsilon>0}$, there exists ${\delta>0}$ such that

$\displaystyle |x-\tilde{x}|<\delta \implies |K(x,y)-K(\tilde{x}, y)|\leq \epsilon.$

Then, if ${|x-\tilde{x}|<\delta}$, we have that

$\displaystyle |\mathbf{K} f_n(x)- \mathbf{K} f_n(\tilde{x})| = \left| \int_0^1 K(x, t) f_n(t)\, dt \right| -\left| \int_0^1 K(\tilde{x}, t) f_n(t)\, dt \right|$

which is bounded by

$\displaystyle \int_0^1 |K(x,t)-K(\tilde{x},t)| |f_n(t)| \, dt \leq \epsilon \int_0^1 |f_n(t)|\, dt\leq \epsilon M,$

where ${M}$ is an upper bound on the norm of the ${f_n}$. The equicontinuity of the ${\mathbf{K} f_n}$, and thus the compactness of ${\mathbf{K} }$, follows. $\Box$

Proposition 2. If ${K}$ is symmetric, then ${\mathbf{K} }$ is self-adjoint.

Proof: We directly compute that

$\displaystyle \langle \mathbf{K} u, v\rangle = \int_0^1\int_0^1 K(s,t)\, u(s) v(t)\, ds\, dt$

and

$\displaystyle \langle u, \mathbf{K} v\rangle = \int_0^1\int_0^1 K(t,s)\, u(s) v(t)\, ds\, dt,$

which will be equal if ${K(s,t)=K(t,s)}$, so ${\mathbf{K} }$ is self-adjoint in this case. $\Box$

Thus, when ${K}$ is continuous and symmetric, the operator ${\mathbf{K} }$ is a compact self-adjoint operator, and so we can apply the spectral theorem to find a set of orthonormal eigenfunctions ${e_j}$ with real eigenvalues ${\kappa_j}$ that form a basis for ${L^2[0,1]}$. The following theorem connects ${K}$ to these eigenfunctions and eigenvalues in the case where ${\mathbf{K} }$ is positive.

Theorem 3 (Mercer). If ${K}$ is symmetric and continuous, and the associated operator ${\mathbf{K} }$ is positive i.e. ${\langle \mathbf{K} u, u\rangle \geq 0}$ for all ${u}$, then we can write

$\displaystyle K(s,t)=\sum_{j=1}^{\infty} \kappa_j e_j(s)e_j(t),$

where the ${\kappa_j}$ and ${e_j}$ are the eigenfunctions and eigenvalues of ${\mathbf{K} }$.

To prove this, we’ll need a lemma.

Lemma 4. If ${\mathbf{K} }$ is positive, then ${K}$ is non-negative on the diagonal.

Proof: Suppose for the sake of contradiction that ${K(r,r)<0}$ for some ${r}$. Choose ${s,t}$ so close to ${r}$ that ${K(s,t)}$ is negative, and choose ${u}$ to be a bump function that is zero except in a small neighborhood around ${(r,r)}$. Then,

$\displaystyle \langle \mathbf{K} u, u\rangle = \int_0^1\int_0^1 K(s,t) u(t) u(s)\, ds \, dt$

will be negative, contradicting the positivity of ${\mathbf{K} }$. $\Box$

Proof of Mercer’s Theorem: Define a sequence of functions

$\displaystyle K_N(s,t)=\sum_{j=1}^N \kappa_j e_j(s) e_j(t),$

and let ${\mathbf{K} _N}$ be the associated sequence of integral operators. We first show that the ${K_N}$ converge uniformly.

The operator ${\mathbf{K} -\mathbf{K} _N}$ has eigenfunctions ${e_j}$ and eigenvalues ${\kappa_j}$ for ${j>N}$, so it is a positive operator. But then ${K-K_N}$ must be non-negative on the diagonal by our lemma, and so

$\displaystyle K(s,s)-\sum_{j=1}^N \kappa_j e_j^2(s)\geq 0\implies K(s,s)\geq \sum_{j=1}^N \kappa_j e_j^2(s)$

for all ${N}$. Since the sum on the right is always non-negative, it must converge for every ${s}$. This gives point-wise monotone convergence of the sequence of partial sums, which means they converge uniformly in ${s}$.

But now by Cauchy-Schwarz,

$\displaystyle \left|\sum_{j=1}^N \kappa_j e_j(s) e_j(t)\right| \leq \left(\sum_{j=1}^N \kappa_j e_j^2(s)\right)^{1/2}\left(\sum_{j=1}^N \kappa_j e_j^2(t)\right)^{1/2},$

and since both of the series on the right converge uniformly, the series on the left does as well.

Let ${K_{\infty}}$ be the limit of the ${K_N}$. Then, by definition, ${\mathbf{K} }$ and ${\mathbf{K} _{\infty}}$ agree on all of the ${e_j}$ and their linear combinations, and since the ${e_j}$ span the space, this means ${\mathbf{K} }$ and ${\mathbf{K} _{\infty}}$ agree everywhere, and so they must be the same operator. But this means they have the same kernel, so ${K= K_{\infty}}$, completing the proof. $\Box$

2. The Kernel Trick

The basic idea of the kernel trick is this: let’s say you have some data points ${\mathbf{x}_1, \mathbf{x}_2,\cdots, \mathbf{x}_N}$. If you want to transform these data points by some map ${\phi}$, you have to apply ${\phi}$ to all of them, and if you want to run an SVM, you have to compute all ${N^2}$ dot products ${\phi(\mathbf{x}_i)\cdot \phi(\mathbf{x}_j)}$. For large ${N}$, this can be a huge pain, so the dream is to choose ${\phi}$ so that you can compute these dot products without actually having to transform the data i.e. find some ${K}$ such that

$\displaystyle \phi(\mathbf{x}_i)\cdot \phi(\mathbf{x}_j)= K(\mathbf{x}_i, \mathbf{x}_j).$

Mercer’s theorem allows us to do exactly this, but in the other direction – if we choose ${K}$ so that the associated integral operator ${\mathbf{K} }$ is positive definite, then we can write

$\displaystyle K(\mathbf{x}_i, \mathbf{x}_j)=\sum_{j=1}^{\infty} \kappa_j e_j(s) e_j(t)= \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j)\rangle,$

where

$\displaystyle \phi: \left[\begin{array}{ccc} x_1& x_2&\cdots \end{array}\right]\mapsto \left[\begin{array}{ccc} \sqrt{\kappa_1} e_j(x_1)& \sqrt{\kappa_1} e_j(x_2)&\cdots \end{array}\right].$

So instead of trying to choose ${\phi}$ so that ${K}$ exists, we just choose an appropriate ${K}$, and don’t even need to think about ${\phi}$. In particular, we can work with the data in another feature space without running directly into computational barriers.