I'm reading "Information Geometry of the EM and em algorithms for Neural Networks" by Shun-Ichi Amari (link) and trying to make sense of it.
To make it concrete, I'm trying to understand the common example (wikipedia) where we have some real data $X$, and we'd like to fit a mixture Gaussian of $K$ Gaussian distributions to it, where each Gaussian has its own parameters $(\mu_i, \Sigma_i)$ that we also want to figure out, in addition to the "mixing parameter" $\tau$ that determines the probability of a sample being generated from each Gaussian. We assume that for each observed sample $X_i$, there's a latent/unobserved $Z_i$ associated with it that determines which Gaussian that $X_i$ was sampled from:
$X_i \mid (Z_i = 1) \sim \mathcal N(\mu_i, \Sigma_i)$
$P(Z_i = 1) = \tau_i, \sum_i \tau_i = 1$
In the paper, he presents the typical EM algorithm as the "em algorithm" (note lowercase), where the E and M steps are actually projections between submanifolds of probability distributions, from one to another. Unfortunately, even though he has an example with the mixture model, I'm having trouble understanding explicitly what's going on.
My understanding is that (in this problem) the main manifold, $S$, is a manifold of all exponential probability distributions, which can be uniquely specified by a coordinate $\theta$, so any point in $S$ is a PDF $p(r; \theta)$, where $r$ is the random variable. For any $\theta$, we can then calculate the corresponding $\eta$, which is the expectation of $r$ over $p(r; \theta)$, so $\eta$ forms another coordinate system on $S$.
We want to fit a mixture Gaussian of $K$ components, which is a subset of the exponential family, so that forms the "model submanifold", $M$.
I'm hazy on the data submanifold $D$. Since it's a submanifold of $S$, that means that the points on $D$ must also correspond to exponential distributions. In the paper he writes that since the data is divided into a visible part $s_v$ ($X$ above) and a hidden part $s_h$ ($Z$ above), $r = r(s_v, s_h)$, $D$ is the set:
$D = \{ \hat \eta \mid \hat \eta = r(s_v, s_h) \}$, where $s_v$ is the observed value and $s_h$ can take arbitrary values.
My question regards the (typical) case in which $r$ is composed of many samples, $r_t, t=1, ..., T$. He says (top of page 23/30):
When the hidden variables $z_t$ are not observed, we can't summarize $r_t, t=1, ..., T$ into a single $\bar r$. We need to treat the product space $S_T^\ast = S_1 \times ... \times S_T$. Partial observation then defines a data submanifold $D_t$, and hence $D_T^\ast = D_1 \times ... \times D_T$.
and
When $x_t$ is observed but $z_t$ is not, $s_v = x_t$ and $s_h = \{\delta_i(z_t) \}$, and the observed data submanifold $D_t$ is given by $D_t = \{ \pmb {\hat \eta} \mid \hat \eta_0 = x_t, \hat \eta_{1i} = \alpha_i, \hat \eta_{2i} = x_t \alpha_i \}$ where $\alpha_i$ are the free parameters corresponding to the unobserved $\delta_i (z_t)$. ... The $D_t$ is a linear submanifold in $\eta$, but it depends on $x_t$. Hence, $D_t$ is different for each $t$, so that we can't summarize them into a single submanifold $D$ but we need to treat the product $D_T^\ast$. The model manifold is simply given by $\pmb \theta_t = \pmb \theta$. Hence, $M_T^\ast$ is a submanifold of $S_T^\ast$.
Here's what I'm confused about: this is suggesting that a point on our product manifold has $(2 K + 1) T$ parameters, because there are $(2 K + 1)$ components of $\pmb{\hat \eta}$ for each $D_t$, and there are $T$ $D_t$'s. But, I know that our answer should eventually have just ~ $3 K$ parameters, because each of the $K$ Gaussians have $2$ parameters and there are $K$ of the $\tau_i$. How is that possible?
and, am I correct about what the explicit definitions for $S, M, D$?