The Meaning of Zero Probability in Fisher-Rao Metric

209 Views Asked by At

I am reading this article and in its first sections the geometry of classical parameter estimation is discussed. The Fisher–Rao metric and statistical distance is introduced and an expression for this metric is given as:

$ds_{FR}^2=\sum_{j}\frac{dp^j dp^j}{p_j}$

And an explanation is given in the next paragraph

Note that the statistical distance in this equation diverges when one of the probabilities $p_j$ tends toward zero. This gives us a clue how to interpret the distance between two distributions: when the probability of one of the measurement outcomes is strictly zero, then obtaining that measurement outcome will allow us to infer with certainty that the system is governed by the other probability distribution.

A figure is given for further clarification:

enter image description here

with the following caption:

The distance between probability distributions $P_A$ and $P_B$ diverges when one of them $(B)$ lies on the hull of the simplex.

I have difficulty understanding what the authors are trying to convey. When one of the probabilities $p_j$ becomes zero, the expression for the Fisher-Rao metric becomes undefined (division by zero). However if we exclude zero itself and only approach zero, then it is obvious that the result will approach infinity. But I fail to see the connection of "certainty" as explained in the text with this diverging distance nor do I see any infinite distance on the probability simplex in the provided figure. In the explanation it is said that if one of the probabilities is zero then it means that the system is governed by the remaining probabilities. Fair enough, but I don't understand how this is related to the undefined or diverging distance and the provided image with its corresponding caption.

The Fisher-Rao metric is a distance measure in the space of probability distributions. As explained in the same section of the article, this metric is closely related to the Euclidean metric, for which we have an intuitive understanding. For any pair of distinct probability distributions NOT containing zero, we can find a positive real number as a quantifier for their distance and draw it on the probability simplex. When we have two probability distributions and one of them contains zero, it is clear that they are distinct so intuitively they must have a positive distance in the space of probabilities and in the picture this intuitive idea is respected since $P_A$ and $P_B$ appear to have a finite distance in the figure. However, the equation suggests that we have a diverging distance in this case and I fail to connect this fact with the intuitive picture.

What kind of distance does Fisher-Rao metric give? Can we extend its definition when we have two distinct probability distributions containing zero probability (not approaching zero but exactly zero)? How do we interpret the outcome of this metric when probabilities approach zero and what exactly authors are trying to communicate?

2

There are 2 best solutions below

2
On

Using barycentric coordinates for the simplex if the point is on the hull it means one of those coordinates is zero. The diagram shows $P_B$ on the hull which means that the $p_2$ component will be zero as it can be defined entirely with respect to $p_1$ and $p_3$. This zero is why the division by zero is related to the hull. In some sense the hull of the simplex is infinity in the same way the point at infinity is constructed in stereographic projection so points on they hull are infinitely far from the interior.

By diverges the author means "does not converge to a real number" which in this case is because it approaches infinity. Divergence doesn't mean it has to approach infinity though such as the sequence $1,-2,3,-4,...$ which diverges but does not approach anything. In this context it only means diverges to infinity however.

0
On

Not sure if my answer addresses what you are interested in understanding, but here's a go anyway.

In general, a singular metric can have two different causes:

  1. the underlying geometry can be singular, eg a point with infinite curvature like a cusp or the tip of a cone;

  2. the underlying geometry is smooth, but the map/parametrisation is singular.

Einstein struggled with this with regard to the metric of the black hole: the singularity at the "centre" of the black hole actually corresponds to singular geometry where the curvature goes to infinity, whereas at the event horizon (the "surface" of the black hole) it is only the parametrisation that is singular (somewhat analogous to the way the poles of the globe are stretched out into lines in a map using longitude and latitude except it is compressed instead of stretched out).

Only looking at the metric, it is not always obvious if the infinity is due to the actual geometry, or just a figment of the parametrisation. However, in this particular case, there is a simple reparametrisation that answers this, which is actually presented in the article.

Let $q_j = 2\sqrt{p_j}$. This makes $dq_j = dp_j/\sqrt{p_j}$ which makes the metric $$ ds^2 = \sum_{j=1}^n \frac{dp_j^2}{p_j} = \sum_{j=1}^n dq_j^2 $$ which is the Euclidean metric. The surface $\sum_j p_j = 1$ for $p_j\ge0$ then corresponds to $\sum_j q_j^2 = 4$ for $p_j\ge 0$, which is simply the part of the radius 2 sphere having non-negative coordinates.

So, if you were to parametrise the statistical model using $q_j$, the distance between the models would simply be the Eucildean metric. However, using the parametrisation $p_j = q_j^2/2$, the mapping/parametrisation is more compressed the closer you get to the edge, resulting in the metric expressed in terms of $p_j$ becoming singular at the edge.

NB: The article uses $x$ instead of $q$, but I'll avoid that since $x$ is often used for sampled values from the model rather than model parameters. Also, the article uses the notation $p^j$ instead of $p_j$ for coordinates of the geometry, which is common in differential geometry, but may be confusing when not familiar with it, so I'll stick with subscripts $p_j$ and $q_j$.

To give an analogy, suppose you have a map showing the northern hemisphere ($z\ge0$) of the unit sphere $x^2+y^2+z^2=1$ projected onto the plane, $(x,y)$. Close to the pole, $(x,y,z)=(0,0,1)$, the distance on the map will correspond well to the distance on the sphere. However, close to the equator, $z=0$, a point a small distance $\epsilon>0$ from the equator on the sphere will and up a distance $\sim\epsilon^2$ from the equator on the $(x,y)$ map.

The two quotes from the article, that you are trying to understand, sound like not very clear explanations.

The first simply states that if you fix one parameter to zero, say $p_k=0$ for some $k$, then the distance depends on the remaining parameters. Perhaps it is just a somewhat clumsy way of trying to explain that even at $p_k=0$, the metric for the remaining parameters is still a valid metric: ie not infinite (except when remaining parameters approach zero as well).

The second quote, I have some difficulty making sense of. Despite the metric in terms of $p_i$ being singular at the edge, the distance (ie metric distance) to the edge is finite, so why they claim that it "diverges" beats me.

I haven't read the full article, so it may well be that the issue simply is the difficulty of trying to explain differential geometry (the math of curved surfaces/manifolds) within the scope of a short article to an audience that are not expected to have prior knowledge of that field.