In a funny coincidence, this post has the same basic structure as my previous one: proving some technical result, and then looking at an application to machine learning. This time it’s Mercer’s theorem from functional analysis, and the kernel trick for SVMs. The proof of Mercer’s theorem mostly follows Lax’s Functional Analysis.
1. Mercer’s Theorem
Consider a real-valued function , and the corresponding integral operator given by
We begin with two facts connecting the properties of to the properties of .
Proposition 1. If is continuous, then is compact.
Proof: Consider a bounded sequence . We wish to show that the image of this sequence, , has a convergent subsequence. We show that is equicontinuous, and Arzela-Ascoli then gives a uniformly convergent subsequence, which in turns gives that is compact.
Since is continuous, for every , there exists such that
Then, if , we have that
which is bounded by
where is an upper bound on the norm of the . The equicontinuity of the , and thus the compactness of , follows.
Proposition 2. If is symmetric, then is self-adjoint.
Proof: We directly compute that
which will be equal if , so is self-adjoint in this case.
Thus, when is continuous and symmetric, the operator is a compact self-adjoint operator, and so we can apply the spectral theorem to find a set of orthonormal eigenfunctions with real eigenvalues that form a basis for . The following theorem connects to these eigenfunctions and eigenvalues in the case where is positive.
Theorem 3 (Mercer). If is symmetric and continuous, and the associated operator is positive i.e. for all , then we can write
where the and are the eigenfunctions and eigenvalues of .
To prove this, we’ll need a lemma.
Lemma 4. If is positive, then is non-negative on the diagonal.
Proof: Suppose for the sake of contradiction that for some . Choose so close to that is negative, and choose to be a bump function that is zero except in a small neighborhood around . Then,
will be negative, contradicting the positivity of .
Proof of Mercer’s Theorem: Define a sequence of functions
and let be the associated sequence of integral operators. We first show that the converge uniformly.
The operator has eigenfunctions and eigenvalues for , so it is a positive operator. But then must be non-negative on the diagonal by our lemma, and so
for all . Since the sum on the right is always non-negative, it must converge for every . This gives point-wise monotone convergence of the sequence of partial sums, which means they converge uniformly in .
But now by Cauchy-Schwarz,
and since both of the series on the right converge uniformly, the series on the left does as well.
Let be the limit of the . Then, by definition, and agree on all of the and their linear combinations, and since the span the space, this means and agree everywhere, and so they must be the same operator. But this means they have the same kernel, so , completing the proof.
2. The Kernel Trick
The basic idea of the kernel trick is this: let’s say you have some data points . If you want to transform these data points by some map , you have to apply to all of them, and if you want to run an SVM, you have to compute all dot products . For large , this can be a huge pain, so the dream is to choose so that you can compute these dot products without actually having to transform the data i.e. find some such that
Mercer’s theorem allows us to do exactly this, but in the other direction – if we choose so that the associated integral operator is positive definite, then we can write
So instead of trying to choose so that exists, we just choose an appropriate , and don’t even need to think about . In particular, we can work with the data in another feature space without running directly into computational barriers.