Proving something is a metric (information UniFrac in this case).

111 Views Asked by At

This question is general and specific at the same time. A function $\delta: X \times X \rightarrow \mathbb{R}$ defined on a set $X$ is a metric if:

  1. $\delta(a, b) \ge 0 $
  2. $\delta(a, b) = 0 \Leftrightarrow a = b$
  3. $\delta(a, b) = \delta(b, a)$
  4. $\delta(a, b) \le \delta(a, c) + \delta(b, c)$

for any $a, b, c \in X$.

Now, for a bit of context consider the weighted UniFrac metric. Given a binary tree with weighted edges and two probability distributions over the edges $A$ and $B$:

$$ {U}_{w}(A, B) = \frac{ \sum^{n}_{i}{{b}_{i} \left | {}^{A}{p}_{i} - {}^{B}{p}_{i} \right |} }{ \sum^{n}_{i}{{b}_{i}} }, $$

where ${b}_{i}$ – is the weight (length) of the $i$-th edge, ${}^{A}{p}_{i}$ and ${}^{B}{p}_{i}$ are probabilities associated with the edge in $A$ and $B$ respectively. This function is widely used in computational ecology these days as a notion of phylogenetic distance between two samples. It has been shown to be equivalent to the Kantorovich-Rubinstein metric (McClelland et al., 2016) also known as the earth-mover's distance.

Recently, a modification has been proposed, dubbed the information UniFrac (Wong et al., 2016):

$$ {U}_{inf}(A, B) = \frac{ \sum^{n}_{i}{{b}_{i} \left | {}^{A}{p}_{i}\log_{2}{{}^{A}{p}_{i}} - {}^{B}{p}_{i}\log_{2}{{}^{B}{p}_{i}} \right |} }{ \sum^{n}_{i}{{b}_{i}} }, $$

which obviously replaces raw probabilities with self-information in the edge-weight normalisation factor. The authors do not provide any proof for this being a metric, which is important for many applications (metric MDS to say the least). The function clearly satisfies the first 3 metric properties, but I'm not sure about the triangle inequality. So far I haven't found a single counter-example violating the property, but I suppose there should be a more rigorous approach to this.

P.S.

Some estimated edge-probabilities in A and B can be equal to zero. Although the authors do not explicitly discuss the way they handle zeros, one can reason that a zero estimate is actually an infinitely small number. In this case, since $\lim _{ p\rightarrow 0 }{ p\cdot \log { p } } = 0$, I replace corresponding self-information with zero.

Update

Thanks to the answer by Daniel Fischer, I now see that instead of focusing on the triangle inequality, I should've paid more attention to the second property. Indeed, if we consider $f(x) = x\cdot \log { x }$ with a special convention $f(0) = 0$, then $f(x)$ has two roots: 0 and 1. Therefore if you consider a tree with two edges $({b}_{1}, {b}_{2})$ and distributions $(A, B)$ such that ${}^{A}{p}_{1} = 1$ and ${}^{B}{p}_{1} = 0$, then ${U}_{inf}(A, B) = 0$, while $A \ne B$, rendering ${U}_{inf}$ a pseudometric.

1

There are 1 best solutions below

2
On BEST ANSWER

These are instances of a more general construction. A function $\delta \colon X \times X \to \mathbb{R}$ is a pseudometric (in analysis also often called semimetric, but topologists use the name semimetric for something else) if it satisfies conditions 1, 3, and 4. So the difference between a pseudometric and a metric is just that in a pseudometric distinct points may have distance $0$.

Then, if we have any function $f \colon X \to \mathbb{R}$, we obtain a pseudometric $\delta_f$ by setting

$$\delta_f(x,y) = \lvert f(x) - f(y)\rvert\,.$$

It's not important that the codomain of the function is $\mathbb{R}$, if we have $g \colon X \to Y$ with $(Y,d)$ a pseudometric space, then

$$\delta_g(x,y) = d(g(x), g(y))$$

is a pseudometric on $X$. Condition 1 is satisfied since the range of a pseudometric consists of nonnegative real numbers, so $d(g(x), g(y)) \geqslant 0$. The symmetry - condition 3 - transfers from $d$ to $\delta_g$, and the triangle inequality also follows from the triangle inequality for $d$:

$$\delta_g(x,z) = d(g(x), g(z)) \leqslant d(g(x), g(y)) + d(g(y), g(z)) = \delta_g(x,y) + \delta_g(y,z)\,.$$

It is straightforward to verify that if $\delta$ is a pseudometric and $c \geqslant 0$ then $c\cdot \delta$ is a pseudometric too, and that the sum of pseudometrics (if we have a sum of infinitely many pseudometrics, we need to impose convergence constraints) is again a pseudometric, using the nonnegativity, symmetry, and triangle inequality for each term.

Your examples are

$$U_{\ast}(A,B) = \sum_{i = 1}^n c_i \delta_{f_i}(A,B)$$

where

$$c_i = \frac{b_i}{\sum_{k = 1}^n b_k}$$

and $f_i(C) = {}^Cp_i$ in the first case, $f_i(C) = {}^Cp_i\cdot \log {}^Cp_i$ - with the convention $0 \cdot \log 0 = 0$ - in the second. Thus the triangle inequality holds for all such constructions.

In this construction, we obtain a metric if and only if for all $x \neq y$ there is an $i$ such that $d(g_i(x), g_i(y)) > 0$. Since we're using the metric $d(a,b) = \lvert a - b\rvert$ on the codomain, that is the case if and only if for all $x \neq y$ there is an $i$ with $g_i(x) \neq g_i(y)$.