Why do we need/use kernels?

64 Views Asked by At

I am currently reading the book "Pattern recognition and machine learning" by Bishop. However I have some trouble fully grasping the chapter on "Kernel methods" and hope someone with better knowledge could clear up my problems.

In the simplest case, we have a linear regression model mapping: $$y(x) = W^T X + b$$

The weigths matrix $W$ can then be easily retrieved via the pseudo-inverse. E.g. Solving for $w$ we get:

$w = (X X^T )^{-1} X^T Y$

This model has obvious problems with non-linear data, thus we introduce non non-linear feature space mapping $\phi(x)$, however $\phi$ is chosen. We would then have to compute: $w = (\phi(X) (\phi(X)^T )^{-1} (\phi(X)^T Y$ $ \ \ \ \ $ (1)

Then by definition the kernel function $k(x,x') = \phi(x) \phi(x') ^T$

The general idea is: The input vector $x$ enters only in the form of scalar products. We can then the scalar product can be replaced with some other choice of kernel.

Why would we want that?

Then:

Working directly with kernels avoids the explicit introduction of $\phi(x)$ allowing the implicit use of a higher feature space.

To me: This just sounds like we replaced $\phi$ by a kernel function, but I don't see what we are gaining doing this. After some more simplifications we arrive at: $y(x) = k(x)^T (K + \lambda I_n) ^{-1}$ where $k(x)$ is a vector with: $k_n(x) = k(x_n,x)$ and $K$ the gram matrix of $\phi$ and $\lambda \ge 0$ $ \ \ \ \ $ (2)

Now we got rid of $W$ and $\phi$

However, comparing $(1)$ and $(2)$ I am not seeing the usefulness: In $(1)$ $w$ and $\phi(X)$ are clearly dependent on another. Thus, for any $\phi$ we chose we would be able to construct a meaningful $w$. In (2) however we got rid of all of that and are basically left with just a kernel function and the Gram matrix that hopefully suits the problem. However, there are no further adaptions. I find this not very intuitive. So why do we need / use kernels? How do they help tackling a problem?