Why use the kernel trick in a support vector machine as opposed to just transforming the data and then using a linear classifier? Certainly, we'll approximately double the amount of memory required to hold the data, original plus transformed, but beyond that it seems like the amount of computation remains about the same. What are the other considerations?
Why use the kernel trick in an SVM as opposed to just transforming the data?
2.6k Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 5 best solutions below
On
The typical kernels used for kernel method induce an infinite dimensional reproducing kernel Hilbert spaces! It's not just double the memory, it's infinite memory.
On
The kernel trick says that given your data $x_i \in \mathbb{R}^n, i \in \{1 \ldots m\}$ and a kernel $k : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}$ having the good properties (*) there is a non-linear transformation $\phi : \mathbb{R}^n \to \mathbb{R}^{m}$ such that $k(x_i,x_j) = \langle \phi(x_i),\phi(x_j)\rangle$.
Let $K_{ij} = k(x_i,x_j)$ the $m \times m$ dot product matrix. Then the good property of $k(.,.)$ is that $K$ is semi-definite positive, so we can diagonalize it to obtain $K = P D P^T$. Letting $Y = P D^{1/2}$ we have $Y Y^T = P D^{1/2} (D^{1/2} P^T) = P D P^T = K$ ie. $\phi(x_i) = Y_{i.}$ (the $i$th row).
That is to say you can do what you said when $m$ is small, but usually $n$ is small but $m$ is very large, so it is not praticable to actually compute $P,D$ and $\phi$ (instead we will compute the first few largest eigenvalues of $K$ in the case of kernel-PCA and spectral clustering)
On
You can use infinite-dimensional spaces with the kernel trick.
You might want to read my SVM summary and especially What is an example of a SVM kernel, where one implicitly uses an infinity-dimensional space?
On
Given all the existing answers, I would like to add more to this story. In fact, in modern large-scale systems people do use data transformation directly instead of the classical kernel trick, which takes $O(N^3)$ time to compute and is way too slow.
However, as typical kernels (e.g, RBF) are infinite-dimensional, one cannot expect to accurately compute $x\mapsto \phi(x)$ as the image of the mapping is infinite-dimensional. Instead, approximations $x\mapsto\tilde \phi(x)$ are used, such that $\langle\tilde\phi,\tilde\phi'\rangle_{\mathcal H}\approx \langle\phi,\phi'\rangle_{\mathcal H}$, and that $\tilde\phi$ has a smaller dimension. This approach is called random Fourier features, or more interestingly random kitchen sink. You can google both of them to get a good idea of what they are doing.
Kernel methods allow you to separate your data in a higher dimensional space without having to actually transform the data. This often does result in less computation compared to actually transforming the data first. The idea is that you only care about how the vectors compare in the other space, i.e. you only care about their inner products, not what the actual representations of those vectors are in that space. A good example to illustrate why this is better is if your kernel is a radial basis function. This is a projection to an infinite dimensional space which you certainly can't do to your data directly.
So, your reasoning is correct, that you could just transform the data to get the same result. But the assumption that the computations are basically the same is not right.