What is the rigorous justification for using inner products as a function of similarity between two vectors?

1.6k Views Asked by At

In machine learning, it is a common thing to define similarity measures, specially using the so call Kernel function. Kernel functions are defined though through inner products of feature vectors:

$$K(x, x') = \langle \phi(x) , \phi(x') \rangle$$

However, I have never been really convinced of what is the justification for interpreting such functions as a similarity measure. What properties do inner products have that are unique or special to them, that they are good candidate functions for defining similarity measures?

For example, another relation that I have found/realized with this topic is in relation to the concept of orthogonality in linear algebra. In linear algebra two vectors are considered orthogonal if:

$$ q^Tq = 0 $$

intuitively, they are pointing in directions that are perpendicular to each other. i.e. they have no components in common. One could think of these as independent vectors and hence they are maximally dissimilar. One could also say they are not correlated. This is consistent with the notion of similarity, i.e. that two vectors that are not similar should have a metric reflecting that. However, its not completely obvious to me as to why the dot product actually has this property. Its good that is consistent with this intuition but for me its a little mysterious. Is there a profound reason that inner products behave like this? Are there no other candidate functions that have this property? Why are we sticking with inner products and not considering other functions?

Furthermore, it seems to me that a lot of quantities like variance and covariance (and correlation) also crucially depend on inner products and dot products to justify their interpretations. Why is it so?

2

There are 2 best solutions below

0
On

Actaually, perpendicularity is defined in terms of the inner product. You can define different inner products that give a very different notion of perpendicularity; in fact another inner product may turn some "almost parallel" vectors into orthogonal ones. But on the other hand tis means: Whatever otherwise "externally justified" idea of independence/orthogonality we have, we can always express the similarity in terms of an inner product - as long as the similarity depends linearly on both input vectors. Also, one should better normalize $\phi$ in such a way that $\phi(x)$ always has constant length under the given inner product. (For example, if one would map an image simply to the one-dimensional number repesenting its average briughtness, then a totally black image would not be similar to anynhing, not even itself)

We could of course switch to non-linear stuff. But then either it is possible to linearize this; or to at least consider everything linear for "sufficiently small" inputs; or things become much less nicely tractable.

0
On

You are putting the cart before the horse. Two objects are not considered to be dissimilar because they are orthogonal under some inner product; rather, the inner product is chosen so that two dissimilar objects would have small inner product.

Going back to the vector space example: inner products can be taken relative to any symmetric positive definite quadratic form. In other words, given any symmetric positive definite matrix $A$, you can define an associated inner product $$ \langle v,w\rangle_A = v^T A w $$ The standard inner product is taken with $A$ being the identity matrix. But you can define an inner product with, say, the matrix $$ A = \begin{pmatrix} 1 & 0 \\ 0 & 100000 \end{pmatrix} $$ Letting $v = (1,1)$ and $w = (1,-1)$, you see that they are orthogonal relative to the standard inner product, but not relative to the $A$-inner product.

In applications of linear algebra, which inner product to use given data in a vector space depends on the natural of the problem, and is one of the things that must be decided in mathematical modelling. And as a function of similarity, it is up to the modeler to choose a "matrix" that encodes the inner product that correctly reflects the nature of the problem.


As to why inner products are used: this actually has to do with the fact that in the sort of problems where inner products are used to measure similarity, there is a natural "scaling structure" to the data. That is to say, data represented by some vector $v$ is considered to be similar to the data represented by the vector $\lambda v$ for any number $\lambda$.

In situations where you don't have this natural scaling, actually, frequently inner products are not the correct choice of similarity measure. (One ends up often endowing the space of objects with the structure of a Riemannian manifold, instead of a simple linear space, and considering the Riemannian distance function as a measure of the similarity.)