Relationship between kernels and metrics

1.3k Views Asked by At

Fix a domain $X$: Let $d : X \times X \rightarrow \mathbb{R}$ be a metric on $X$, with the properties

  • $d(x,y) \ge 0$
  • $d(x,y) = 0 \iff x = y$ for all $x,y$
  • $d(x,y) = d(y,x)$ for all $x,y$
  • $d$ satisfies the triangle inequality

In machine learning, a kernel is a function $K : X \times X \rightarrow \mathbb{R}$ with the properties

  • $K(x,y) = 1 \iff x = y$
  • $K(x,y) = K(y,x)$
  • for all $n$ and all sequences $x_1, \ldots, x_n$, the Gram matrix $\mathbf{K}$ with $\mathbf{K}_{ij} = K(x_i, x_j)$ is positive definite.

A space that admits a kernel is handy in ML, and often if all you have is a distance function, you can compute a kernel-like object in the form

$$ \kappa_d(x, y) = \exp(- \gamma d^2(x,y)) $$

Let's assume I have a kernel available, which is much more complex then this kernel $\kappa_d(x, y)$, but I do not have "underlaying" metric. For example I have something like the Levenshtein distance, but where values close to 1 mean high similarity and values close to 0 means low similarity.

Let's further assume I have a series of objects $x_i$ and $y_i$ for which I can compute the kernel values. Now, let's assume, I am able to extract the metric from all the kernel values by using an algorithm, e.g. "Force Atlas 2" in a high dimensional space.

What conditions do I need to place on the kernel to make this possible - to receive a metric holding the triangle inequality? What conditions do I need to place on the kernel if I want to extract an underlying metric space between the objects I have a available.


The question is somewhat related (but the other way around) to Transforming a distance function to a kernel

2

There are 2 best solutions below

0
On

Firstly, let's narrow our attention to Riemannian metrics. "What conditions should your kernel obey to induce a well-defined Riemannian metric?". In this case, your metric will obey the triangle inequality, and is perhaps the type of metric you're looking for if you're doing machine learning.

Firstly, your kernel should be twice differentiable. Secondly, you need to show that the metric induced locally by the kernel is quadratic in displacement (Amari & Wu 1999). That is, let $\Phi(x)$ represent a a feature vector, i.e. a (non-linear) projection of your inputs $x\in X$ to a much larger space, then the line element between two close points in this space is given in terms of the kernel as,

\begin{align} ds^2 =& ||\Phi(x+dx) - \Phi(x)||^2\\ =& \kappa_d(x+dx),x+dx) - 2 \kappa_d(x, x+dx) + \kappa_d(x,x)\\ =& g_{uv} dx^u dx^v \end{align}

Expanding the second line to second order in $dx$ and equating it to the last line gives,

\begin{align} g_{uv} =& \frac{1}{2} \partial_u \partial_v \kappa_d(x,x) - [\partial_u' \partial_v' \kappa_d(x,x')]_{x'=x} \end{align}

Now to show that this metric is Riemannian, you should desire $ds^2 \propto ||dx||^2$. An example of kernel that doesn't induce a Riemannian metric is the Arc-cosine $n=0$ kernel (Cho & Saul 2011).

Now, if you open up the question to what constraints must you place on the kernel to induce any metric, which must obey the triangle inequality, then you need to consider the distance between all points in your manifold and show that for any two points $x,y \in X$ the distance between $x$ and $y$ is equal to the infimum of distances along all possible paths in your manifold between $x$ and $y$. When you give the example of Levenshtein distance which acts on strings, then you must show that for all words pairs, given the "delete", "substitute", and "insert" operations, that the Levenshtein distance is the shortest path between between them (which is true).

0
On

Given a kernel function $k$, this is a metric: $$d(x,y) = \sqrt{1-k(x,y)}$$

It isn't the only one, but it has the merit that it is the natural metric of the Reproducing Kernel Hilbert Space induced by $k$ - or to be exact, $1/ \sqrt{2}$ times that metric.

In that vector space, $k$ is the dot product, and: $$k(x-y,x-y) = k(x,x) - 2k(x,y) + k(y,y)$$