I am trying to understand the kernel trick in Machine Learning and I'm struggling a bit with (a simplified version of) Mercer's Theorem. I understand that one can write the Gram matrix $\Phi\Phi^T=:\mathcal{K}$ as a matrix containing kernel functions $k$ which is easier to evaluate than calculating the dot products of feature mapped observations, since $k:\mathbb{R}^D\times\mathbb{R}^D\rightarrow\mathbb{R}$ and $\langle\cdot,\cdot\rangle:\mathbb{R}^K\times\mathbb{R}^K\rightarrow\mathbb{R}$, with $D<<K$:
$$\mathcal{K}=\begin{pmatrix}\langle\Phi({\bf{x}}_{1}),\Phi({\bf{x}}_{1})\rangle & \langle\Phi({\bf{x}}_{1}),\Phi({\bf{x}}_{2})\rangle & \cdots &\langle\Phi({\bf{x}}_{1}),\Phi({\bf{x}}_{N})\rangle \\ \vdots & \vdots & \ddots & \vdots \\ \langle\Phi({\bf{x}}_{N}),\Phi({\bf{x}}_{1})\rangle & \langle\Phi({\bf{x}}_{N}),\Phi({\bf{x}}_{2})\rangle & \cdots &\langle\Phi({\bf{x}}_{N}),\Phi({\bf{x}}_{N})\rangle\end{pmatrix}$$$$=\begin{pmatrix}k(\bf{x}_1,\bf{x}_1) & k(\bf{x}_1,\bf{x}_2) & \cdots &k(\bf{x}_1,\bf{x}_N) \\ \vdots & \vdots & \ddots & \vdots \\ k(\bf{x}_N,\bf{x}_1) & k(\bf{x}_N,\bf{x}_2) & \cdots &k(\bf{x}_N,\bf{x}_N)\end{pmatrix}$$
I am given the following formulation of Mercer's Theorem:
A symmetric function $k(\bf{u},\bf{v})$ can be expressed as a scalar product $$k(\bf{u},\bf{v})=\langle\Phi(\bf{u}),\Phi({\bf{v}})\rangle$$
for some $\Phi$ if and only if the kernel matrix $\mathcal{K}\in\mathbb{R}^{N\times N}$ containing entries $(\mathcal{K})_{i,j}=k(\bf{x}_i,\bf{x}_j)$ is positive semi-definite for any collection $\{\bf{x}_n\}_{n=1}^{N}$.
Now, my question is the following: Is the following statement a valid conclusion from Mercer's Theorem?
- This means that, given any feature map $\Phi$, we can write the entries of our kernel matrix $\mathcal{K}=\Phi\Phi^T$ as a symmetric, positive definite function $k:\mathbb{R}^D\times\mathbb{R}^D\rightarrow\mathbb{R}$ of the base observations $\{\bf{x}_n\}_{n=1}^{N}$.
I'm struggling with the directions of implications of this theorem. Can we always contruct a kernel function $k$ from any feature map $\Phi$? For simple feature maps like the polynomial I can derive $k$ easily. But what if the feature map is more complicated? Or, more precicely, since any Gram matrix is positive semi-definite, is there always a kernel function $k$ one could use instead of the dot product of feature mapped observations? Or are there feature maps $\Phi$ for which no kernel function $k$ exists?
We can directly show your proposed function $$ k(x,y) = \langle \phi(x),\phi(y)\rangle $$ is positive definite. By linearity of the inner product, note that \begin{align} \sum_{i=1}^n\sum_{j=1}^n c_i c_j k(x_i,x_j) &= \sum_{i=1}^n\sum_{j=1}^n c_i c_j \langle \phi(x_i),\phi(x_j)\rangle \\ &= \left\langle \sum_{i=1}^n c_i \phi(x_i),\sum_{j=1}^n c_j\phi(x_j)\right\rangle \\ &= \left\|\sum_{i=1}^n c_i \phi(x_i)\right\|^2 \geq 0, \end{align} with equality only when each $c_i = 0$ and we have shown positive definiteness. Note that this result holds for generic inner product spaces. Alternatively, one can begin with a symmetric, positive definite function $k$ and construct the feature map as $\phi(x) = k(x,\cdot)$.