Understanding the existence of a kernel function that transforms non linearly separable samples to separable samples in general

Question

Understanding the existence of a kernel function that transforms non linearly separable samples to separable samples in general

132 Views Asked by Bumbble Comm At 13 Apr 2026 - 1:37

I'm studying section $5.11$ on support vector machines from Duda and Hart's Pattern Classification. The authors write:

With an appropriate nonlinear mapping $\phi$ to a sufficiently high dimension, data from two categories can always be separated by a hyperplane (Problem 27)

Suppose we have $n$ points of dimension $d$ with $n_1, n_2$ points in classes $1,2$ respectively. There seem to be many posts that either show visualizations of kernels transforming non-linearly separable representations to separable ones, but very little information describing how these kernel functions are chosen for examples that are not hand-crafted, and not much proof of linear separability in the transformed space. Posts like these (https://stackoverflow.com/questions/3997454/help-me-understand-linear-separability-in-a-binary-svm).

I'm looking for a hint why such a kernel function always exists, assuming no restraints on the dimension of the transformed space.

Original Q&A

There are 3 best solutions below

anonymous On 04 Jan 2023 - 8:56

Claudio Moneo is right: using the Gaussian kernel yields linear separability. One considers the map

$ \phi :X\rightarrow V,\; x\mapsto K(\cdot,x) $

where $K$ is the Gaussian kernel of bandwidth $1$ say, $X=\{x_1,\ldots ,x_{n_1},y_1,\ldots ,y_{n_2}\}$ and $V$ is the vector space generated by the functions $K(\cdot ,x)$. It is known, that these functions then form a basis of $V$. So we can define a linear functional $f:V\rightarrow\mathbb{R}$ by $f(\phi(x_i)):=-1$ for all $i$ and $f(\phi(y_j))=1$ for all $j$. The hyperplane $f^{-1}(0)$ then separates the classes.

Bumbble Comm On 04 Jan 2023 - 9:10

I am going to give an answer based on the Gaussian kernel $$K(x,y) = \exp(- \gamma \|x-y \|^2)$$ First note that any PSD kernel is associated with a Hilbert space $H$ that it implicitly maps points to through its feature map $\phi$. This feature map satisfies $$K(x,y) = \langle \phi(x), \phi(y) \rangle$$ for all $x,y \in \mathbb{R}^d$. These Hilbert spaces are actually reproducing kernel Hilbert spaces (RKHS), on which there is a vast literature. You could take a look at wikipedia if you want.

The Gaussian kernel satisfies the nice property of being universal, which means that for any two compact, disjoint sets $A,B \subset \mathbb{R}^d$ there exists some $w \in H$ such that $$sgn(\langle w , \phi(x) \rangle) = 1_A(x) - 1_B(x)$$ Note that this implies that we have linearly separated $A$ and $B$ in the space $H$, which is what you wanted (just choose $A,B$ as subsets of your dataset corresponding to whatever labels your points have).

For the Gaussian kernel, one can show that $H$ is of infinite dimension. A feature map is given by $$\phi(x) = exp(- \gamma \|x- \cdot \|^2)$$ Note that we map a point to a function (a Gaussian centered at that very point, to be specific)!

So why is the Gaussian kernel universal? There are more elegant proofs on this (using e.g. Stone Weierstrass theorem), but if you trust me that for $n$ distinct points, the vectors $v_i = \phi(x_i)$ are linearly independent in the feature space $H$ (which makes sense because no Gaussian can be written as the linear combination of any weighted sum of Gaussians centered at other points) then there is a quick way to see this:

For $n$ linearly independent points $v_1,\dots,v_n$ we may simply define the map $$f(v_i) = 1_A(x_i) - 1_B(x_i)$$ and extend it to a linear functional on the linear span of the $n$ points $v_1,\dots,v_n$ in the feature space. By Riesz representation, there must hence exist $w \in H$ such that $$f(v_i) = \langle w, \phi(x_i) \rangle$$

**anonymous** · Accepted Answer

In the exercise it is not required that the non-linear map $\phi$ is determined by a kernel function. If $x_1,\ldots ,x_{n_1}\in\mathbb{R}^d$ are the samples of class $1$ and $y_1,\ldots ,y_{n_2}$ are those of class $2$ one could map the samples $x_i$ to $(x_i,0)\in\mathbb{R}^{d+1}$ and $y_i$ to $(y_i,1)$ and extend the resulting map somehow to the whole of $\mathbb{R}^d$, if one wishes so. The hyperplane $x_{d+1}=\frac{1}{2}$ separates the two classes.

Understanding the existence of a kernel function that transforms non linearly separable samples to separable samples in general

There are 3 best solutions below

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions