Mercer's Theorem importance (Kernels)

1.2k Views Asked by At

I understood that Mercer's Theorem extends the definition of kernels also for infinite input space.

In Machine Learning realm our training set is always finite and hence the input space is always finite. So why are we interested in infinite input spaces?

Where is the flaw in my reasoning?

1

There are 1 best solutions below

5
On BEST ANSWER

The training set is characterised by two numbers:

  • $N$, the number of training points
  • $F$, the number of features

The phrase "infinite input space" refers to $F$, the number of features $F$, being infinite. It's not saying you need infinitely many training points!

In a typical use case, each feature vector $\mathbf x_i$ in your raw training set is a vector in $\mathbb R^{F_{\rm raw}}$, where the number of raw features $F_{\rm raw}$ is a finite number. The idea is to define a mapping $$\Phi : \mathbb R^{F_{\rm raw}} \to \mathbb R^{F_{\rm derived}}$$ that converts these raw feature vectors $\mathbf x_i$ into derived feature vectors $\Phi (\mathbf x_i)$. The number of derived features $F_{\rm derived}$ can be much larger than the number of raw features $F_{\rm raw}$.

We then train a linear regressor (say) using the derived feature vectors as inputs, giving predictions $\hat y$ that are linear in the derived feature vectors $\Phi(\mathbf x)$ but are non-linear in the raw feature vectors $\mathbf x$:

$$ \hat y = \mathbf w . \Phi (\mathbf x) $$ Thus, by using this feature mapping trick, we are able to model effects that are non-linear in the raw inputs even though we are essentially using a linear regression technique.

As mentioned earlier, $F_{\rm derived}$, the number of derived features, can be much larger than $F_{\rm raw}$. In fact, there is no reason why we can't take $F_{\rm derived}$ to be countably infinite. (We require Hilbert space techniques to interpret things like the dot product $\mathbf w . \Phi(\mathbf x)$ that contain an infinite sum, but that's okay.) So this is where the infinity comes in.

But then, what is the role of the kernel? Suppose that one assumes that the following relationship between inputs and outputs: $$ y = \mathbf w . \Phi(\mathbf x) $$ where the weights $\mathbf w$ have a prior distribution that is unit Gaussian: $$ \mathbf w \sim \mathcal N(\mathbf 0, \mathbf I).$$

Then the covariance between the outputs $y(\mathbf x)$ and $y(\mathbf x')$ at two points $\mathbf x$ and $\mathbf x'$ is $$ {\rm Cov}[y(\mathbf x), y(\mathbf x')] = \Phi(\mathbf x) . \Phi(\mathbf x') := \kappa (\mathbf x, \mathbf x').$$

Notice that the kernel function $\kappa (\mathbf x, \mathbf x')$ tells us something important about our choice of feature mapping $\Phi$: it tells us about how outputs at two different points $\mathbf x$, $\mathbf x'$ are correlated, given our choice of feature mapping $\Phi$.

In my explanation, I started with the mapping $\Phi$ and derived the kernel function $\kappa(\mathbf x, \mathbf x')$ that characterised the correlation structure. But in the real world, it sometimes makes more sense to go the other way: one postulates a kernel function $\kappa(\mathbf x, \mathbf x')$ that captures one's prior intuitive beliefs about the correlation structure, and then reverse engineers a feature mapping $\Phi$ that gives rise to this $\kappa(\mathbf x, \mathbf x')$. (Technically, it's not necessary to write down the explicit expression for the $\Phi$ that gives rise to this $\kappa(\mathbf x, \mathbf x')$: one is content to know that some such $\Phi$ exists; there are clever calculation techniques that work directly with $\kappa(\mathbf x, \mathbf x')$ instead of working with $\Phi$.)

Mercer's theorem is what enables us to do this reverse engineering. It tells us that, for any symmetric positive-definite choice of kernel $\kappa(\mathbf x, \mathbf x')$, it's possible to find a feature mapping $\Phi : \mathbb R^{F_{\rm raw}} \to \mathbb R^{F_{\rm derived}}$ such that $\kappa(\mathbf x, \mathbf x') = \Phi(\mathbf x). \Phi(\mathbf x')$. But there is a catch: $F_{\rm derived}$ may need to be countably infinite! And that's why we need to consider infinite derived feature spaces.

In my explanation of the feature mapping trick, I used the example of Bayesian linear regression. This is only one application out of many. The feature mapping trick can also be applied to other machine learning techniques, such as support vector machines and PCA. In every example, the feature mapping trick turns a linear technique into a technique capable of learning effects non-linear in the original raw inputs. The kernel $\kappa(\mathbf x, \mathbf x') := \Phi(\mathbf x).\Phi(\mathbf x')$ is a mathematical gadget that computes dot products in the higher-dimensional derived feature space in terms of the raw feature vectors $\mathbf x$ and $\mathbf x'$. Finally, Mercer's theorem tells us that that given any positive-definite kernel $\kappa(\mathbf x, \mathbf x')$, it is possible in principle to reverse engineer a feature mapping function $\Phi$ that gives rise to this kernel, provided we allow the number of derived features to be infinite.