In machine learning, it is a common thing to define similarity measures, specially using the so call Kernel function. Kernel functions are defined though through inner products of feature vectors:
$$K(x, x') = \langle \phi(x) , \phi(x') \rangle$$
However, I have never been really convinced of what is the justification for interpreting such functions as a similarity measure. What properties do inner products have that are unique or special to them, that they are good candidate functions for defining similarity measures?
For example, another relation that I have found/realized with this topic is in relation to the concept of orthogonality in linear algebra. In linear algebra two vectors are considered orthogonal if:
$$ q^Tq = 0 $$
intuitively, they are pointing in directions that are perpendicular to each other. i.e. they have no components in common. One could think of these as independent vectors and hence they are maximally dissimilar. One could also say they are not correlated. This is consistent with the notion of similarity, i.e. that two vectors that are not similar should have a metric reflecting that. However, its not completely obvious to me as to why the dot product actually has this property. Its good that is consistent with this intuition but for me its a little mysterious. Is there a profound reason that inner products behave like this? Are there no other candidate functions that have this property? Why are we sticking with inner products and not considering other functions?
Furthermore, it seems to me that a lot of quantities like variance and covariance (and correlation) also crucially depend on inner products and dot products to justify their interpretations. Why is it so?
Actaually, perpendicularity is defined in terms of the inner product. You can define different inner products that give a very different notion of perpendicularity; in fact another inner product may turn some "almost parallel" vectors into orthogonal ones. But on the other hand tis means: Whatever otherwise "externally justified" idea of independence/orthogonality we have, we can always express the similarity in terms of an inner product - as long as the similarity depends linearly on both input vectors. Also, one should better normalize $\phi$ in such a way that $\phi(x)$ always has constant length under the given inner product. (For example, if one would map an image simply to the one-dimensional number repesenting its average briughtness, then a totally black image would not be similar to anynhing, not even itself)
We could of course switch to non-linear stuff. But then either it is possible to linearize this; or to at least consider everything linear for "sufficiently small" inputs; or things become much less nicely tractable.