Kernel-trick in the primal support vector machine problem

252 Views Asked by At

I have recently found the following statement of the support vector machine problem in a paper while doing research for a seminar. Let $P$ be the matrix with rows of positive samples, $N$ of negatives, and $w, \gamma$ define the classifier. $$\begin{aligned} \min (1-\lambda)(1^Ty+1^Tz)+\frac\lambda 2\lVert w\rVert_2^2\\ s.t.\quad\quad-Pw+1\gamma+1\le y,\\ Nw-1\gamma+1\le z,\\ y,z\ge 0, \end{aligned}$$ The way I understand the kernel trick is that if we want to pull our data into a higher dimension using some transformation $\varphi$ to then solve the SVM problem in that higher dimension, we only need a function $k(x_1, x_2)=\varphi(x_1)\cdot \varphi(x_1)$ called a kernel function, and never need to apply $\varphi$ directly.

Now usually the kernel trick is introduced as a specific advantage of the dual problem, not the primal. However, in the above statement of the primal SVM problem, only dot products of vectors appear! The dot products of $w$ with itself for $\lVert w\rVert_2^2$ and of samples with $w$ in the constraints.

EDIT: I understand now that one key advantage is being able to precalculate dot products of the training data before optimizing. But still isn't it possible (and useful) to apply a nonlinear transformation into a higher space in the primal?

Is there a reason why we can't replace the dot products in the problem with the kernel function, thereby applying the kernel trick in the primal?

1

There are 1 best solutions below

4
On BEST ANSWER

The Kernel trick has two advantages:

  1. It allows you to summarize the relevant aspects of the data in the sense that you do not need the original data anymore, as long as you have the kernel.

  2. You can apply a nonlinear transformation to the data without explicitly stating what the transformation is, e.g., by using the Gaussian kernel.

You cannot do either in the primal problem.