Uniqueness condition for the em algorithm in information geometry

207 Views Asked by At

In his book, Information Geometry and its Applications, p. 28, Amari makes a statement without proof (quoted in the form of an image below) that seems to be incorrect. I would like to know whether the claim is wrong, or if I have understood it incorrectly. If I've misunderstood it, I'd like to know the correct statement, and I'd like to see a proof.

Amari seems to be saying that if we take an $e$-flat manifold $E$ of probability distributions (such as an exponential family) and an $m$-flat manifold $M$ (such as a mixture family) and iterate the following procedure, it will always converge to a unique solution for $Q_t$:

  1. Let $P_0$ be an arbitrary distribution in $M$.
  2. Let $Q_{t} = \mathrm{argmin}_{Q\in E} (P_t\|Q)$
  3. Let $P_{t+1} = \mathrm{argmin}_{P\in M} (P\|Q_t)$.

However, the following seems to be a counterexample. Consider the manifold of all joint distributions for two binary random variables, $A$ and $B$. That is, all $p_{00}, p_{01}, p_{10}, p_{11}$ such that $p_{00} + p_{01} + p_{10} + p_{11} = 1$.

Then let $E$ be the set of all distributions such that the two variables are independent, i.e. $p(a,b) = \frac{1}{Z}e^{\theta_a a + \theta_b b}$, with $Z$ chosen to normalise the distribution. This is an exponential family with parameters $\theta_a$, $\theta_b$. For an arbitrary $p(a,b)$, $\mathrm{argmin}_{Q\in E} (P\|Q)$ is given by $p(a)p(b)$.

Then let $M$ be the set of all distributions such that $p(A=B)=1$, i.e. $p(0,0) = m$; $p(1,1) = 1-m$; $p(0,1) = p(1,0) = 0$. This is a mixture family with a single parameter $m$. For an arbitrary $Q$, $\mathrm{argmin}_{P\in M} (P\|Q)$ is given by a distribution of this form with $m = q(0,0)/(q(0,0)+q(1,1))$.

If we start with some arbitrary $m_0$ we can see that $q_t(A=0) = q_t(B=0) = m_t$, and hence $m_{t+1} = m_t^2/(m_t^2 + (1-m_t)^2)$. However, this has a repelling fixed point at $m=1/2$. If $m_0<1/2$ then $m_t\to 0$, and if $m_0>1/2$ then $m_t\to 1$.

Geometrically it looks like this enter image description here The red surface is the independence manifold $E$ and the green dashed line is $M$ as I defined it. The two manifolds approach each other at two points, so each manifold does not have a unique 'closest' point to the other one.

I wondered if it's $D_\text{KL}(P_t\|Q_t)$ that converges to a unique solution, rather than $P_t$ or $Q_t$. But that doesn't seem likely, because I imagine one could change this exponential family slightly so that it approaches $M$ at one point but stays a finite distance away at the other. (I have not checked this though, and if I'm wrong it would be interesting.)

To repeat my question, did I misinterpret Amari's statement correctly? If the statement is incorrect, is there something simple that can be changed to make it true, and where can I find a proof?


Here is the relevant passage in the book, from the top of p. 28. The statement I'm referring to is in the last two sentences.

enter image description here

1

There are 1 best solutions below

4
On BEST ANSWER

Given a statistical manifold $(S,g,\nabla^e,\nabla^m)$ and submanifolds $E,M$, we consider the sequence $((p_n,q_n))_{n=0}^{\infty}\subset E\times M$ defined by \begin{align} q_n&=\arg\min_{q\in M} D(p_{n-1}||q), \\ p_n&=\arg\min_{p\in E} D(p||q_n), \end{align}

where $D$ is the divergence induced by $\nabla^e$.

In order for this sequence to be well-defined, the minimization problems in each step need to have a unique global minimizer (i.e. no ties). This is achieved if $E,M$ are flat and convex with respect to $\nabla^e$ and $\nabla^m$ respectively. The uniqueness comment in the literature, although admittedly unclear, probably refers to this notion of uniqueness. Absent the conditions above, the sequence could potentially branch at each step. If ties are broken randomly, the random sequence would have a random limit point.

Your question is whether the sequence above has a limit $(p_{\infty},q_{\infty})$ that is independent of the starting point $p_0$. As your example demonstrates, this is false in general. In fact, a plethora of examples can be constructed where there are multiple intersections of $E$ and $M$. For another example, consider the 2-dimensional unit simplex $S_2$ and take $E$ to be the exponential family connecting two corners and passing through the center, and $M$ any mixture family connecting two sides and intersecting $E$ in two points. Then the sequence above will converge to one of the intersections depending on the starting point.

In relation to the EM algorithm, mimimizing $D(E||M)=\min_{p\in E,q\in M}D(p||q)$ is equivalent to maximizing the marginal likelihood of observed data over all the models in $E$ when $M$ is taken to be the specific mixture family induced by the observed data (the data manifold as Amari calls it). If the limit of the EM sequence is the same regardless of the initialization, this implies that the marginal likelihood only has one global minimum, which is clearly not always the case.