Information Geometry and Divergences

156 Views Asked by At

I've been reading Amari's Information Geometry book and, on page $10$ he defines what a divergence is. It goes as follows:

Let us consider two points $P$ and $Q$ in a manifold $M$, of which coordinates are $\xi_P$ and $\xi_Q$. A divergence $D[P:Q]$ is a function of $\xi_P$ and $\xi_Q$ which satisfies certain criteria. We may write it as $$D[P:Q] = D[\xi_P:\xi_Q].$$ We assume that it is a differentiable function of $\xi_P$ and $\xi_Q$.

Definition 1.1 $D[P:Q]$ is called a divergence when it satisfies the following criteria:

  1. $D[P:Q] \geq 0$.
  2. $D[P:Q] = 0$, when and only when $P = Q$.
  3. When $P$ and $Q$ are sufficiently close, by denoting their coordinates by $\xi_P$ and $\xi_Q = \xi_P + d\xi$, the taylor expansion of $D$ is written as $$D[\xi_P : \xi_P + d\xi] = \frac12 \sum g_{ij}(\xi_P)d\xi_id\xi_j + O(|d\xi|^3),$$ and matrix $G = (g_{ij})$ is positive-definite, depending on $\xi_p$.

I have several questions. The first two itens are fine, but I don't know:

  • What it means that $P$ and $Q$ are sufficiently close;
  • How to sum a point in $\mathbb{R}^n$ and the differential of a function ($\xi_P + d\xi$);
  • What are the terms $g_{ij}$ and how they depend on $\xi$;
  • How is a number $D[\xi_P : \xi_P + d\xi]$ a sum of differential forms $d\xi_i d\xi_j$;

I don't mind doing things whith less formality, but I don't know much of the lingo, so I usually get lost in advanced texts and things written without using the full formal ideas. Hoping to fix that flaw during my studies.

1

There are 1 best solutions below

0
On

One thing you should notice is that this definition is a definition on local coordinate: $\xi_P$ and $\xi_Q$ are coordinates of $P$ and $Q$, not $P$, $Q$ themselves. I think the author wants to make this definition easy an friendly, but he violated a basic of defining an object on a manifold: once something is defined by a local coordinate, then it has to be verified that this definition does not depend on the coordinate. This makes mathematicians feel uncomfortable. (I will give what I think is a proper statement of this condition in the last part of this answer)

Once you know this, the four questions you asked answer of what you asked are clear

  • $P$, $Q$ sufficiently close means that they together can be included in a single coordinate, and their coordinate expression $\xi_P, \xi_Q$ are close together.
  • here $\mathrm{d}\xi$ is not a differential, it is just a vector in $\mathbb{R}^n$, like $\xi_P$ and $\xi_Q$
  • $\xi\mapsto g_{ij}(\xi)$ is a matrix-valued (don't even think of it as a Riemann metric) function on the coordinate region. The definition imposes that $g_{ij}(\xi)$ is positive-definite for each $\xi$.
  • the entire expression $$D\left[\xi_P: \xi_P+d \xi\right]=\frac{1}{2} \sum g_{i j}\left(\xi_P\right) d \xi_i d \xi_j+O\left(|d \xi|^3\right)$$ contains no differential of functions at all: $\mathrm{d}\xi$ is a vector in $\mathbb{R}^n$, and $\mathrm{d}\xi_i$ is the $i$-th component of it.

Actually, to make this definition coordinate-free and rigorous, this third condition can be written as:

For every $P \in M$, $P$ is a non-degenerate local minimal point of the function $D(P,\cdot): M \to \mathbb{R}$

(Note that here I consider $D$ as an $M\times M \to \mathbb{R}$ smooth function, so $D(P,\cdot): M \to \mathbb{R}$ is a smooth, real-valued function)

Here minimal value means, of course, that there exists a neighborhood $U$ of $P$, such that $f(P)$ is the minimal value of $f(U)=\{f(Q): Q\in U\}$. It is not hard to prove that a local minimal point is a critical point, meaning that for any $v\in T_PM$, $v(f)=0$, or put it more simply, $df|_P=0$.

And "non-degenerated" has the following meaning: for a smooth function $f: M\to \mathbb{R}$, if $p$ is one of its critical point, then we say that it is "non-degenerate" if for some coordinate $(U; x^i)$, the Hessian $\{\frac{\partial^2 f}{\partial x^i \partial x^j}|_p\}$ is non-singular. It is easy to check that this definition is actually independent of the choice of coordinates. Moreover if we know that $f$ is locally minimized at $p$, then this Hessian is positive-definite (again, this is independent of the choice of coordinates), or we can say it in a more mathematically fancy way that the Morse index of $f$ at $p$ is zero.


And another comment: the author later on defined the Riemann metric on this manifold directly $$ d s^2=2 D[\boldsymbol{\xi}: \boldsymbol{\xi}+d \boldsymbol{\xi}]=\sum g_{i j} d \xi_i d \xi_j $$ Actually, to rigorously define this metric, we will need to describe as the follows:

Firstly, note that for a smooth manifold $M$, second differentiable function $f:M\to \mathbb{R}$, if $p$ is a critical point of $f$, then for any vector $v \in \mathbb{R}$, it can be deduced that the second derivative "$\frac{\partial^2 f}{\partial v^2}(p)$" is unrelated to coordinate: we can define it as $$ \lim_{t\to 0}\frac{\gamma(t)-\gamma(0)}{t^2},\quad \gamma(0)=0, \gamma'(0) = v $$ and by projecting to an arbitrary coordinate we can see that the choice of $\gamma$ does not really matter for this definition, as long as it satisfies $\gamma(0)=0, \gamma'(0) = v$, so $v\mapsto \frac{\partial^2 f}{\partial v^2}(p)$ is a well-defined map from $T_pM $ to $\mathbb{R}$, and $(v,w)\mapsto (\frac{\partial^2 f}{\partial (v+w)^2}(p)-\frac{\partial^2 f}{\partial (v-w)^2}(p))/4$ is a symmetric bilinear function on $T_pM$ (in a local coordinate, this bilinear function's matrix is just Hessian matrix at $p$). Moreover, if $p$ is a non-degenerate minimal, then this bilinear form is an inner product.

(also notice that this definition of "second order derivative" cannot be extrapolated to non-critical points of the function: if $\mathrm{d} f|_p \neq 0$, then in order to use the definition above, we have to minus this $f$ by some function $g$ such that $\mathrm{d}f = \mathrm{d}g$, and the problem is that we have no natual way to select this $g$ uniquely without relying to other structures on the manifold, such as an affine structure)

Just like what I have said before, $D$ is an $M\times M\to \mathbb{R}$ smooth function, and for each $P\in M$, $D[P,\cdot]:M\to \mathbb{R}$ has a non-degenerate minimal point $P$ (we will denote $D[P,\cdot]$ as $D_P$ for convenience). Therefore, $$v\mapsto \frac{\partial^2 D_P}{\partial v^2}(P)$$ is a well-defined real-valued function on $T_PM$, and $$ g_P(v,w) = \frac{1}{4}\left(\frac{\partial^2 D_P}{\partial {(v+w)}^2}(P)-\frac{\partial^2 D_P}{\partial {(v-w)}^2}(P)\right) $$ is a well-defined inner product at $T_PM$.

Finally, let $P$ traverse all points in $M$, then we can define an the inner product $g_P$ for every $T_PM$, and this gives a Riemann metric, which is what the author tries to define. It is worthy noting that we are considering different functions $D_P$ for different points $P$: If we consider the same function $F$ for all $P$, then it is impossible that every point $P$ is a non-degenerate critical point of this function $F$, and thus we cannot obtain this Riemann metric.

These are the rigorous definitions of a divergence function and the metric induced by it. Hope it is clear enough for mathematicians. However, when it comes to real calculation, this definition might not be as useful as the author's simple definition in a local coordinate chart.