Averaged log-likelihood with a latent variable for mixture models

114 Views Asked by At

In class we've defined the following:

$$Q(\theta; \theta^t) = \sum_z P(Z=z\mid X=x; \theta^t) \log P(X=x; Z=z;\theta)$$

It's part of the EM algorithm. Here, $\theta^t$ are the assumed parameters at time $t$.

I just wonder, why wasn't it defined as conditional probability, like this: $$\log P(X=x \mid Z=z;\theta)$$

I'm looking for the intuition here, other than "it's the definition".

Thanks

1

There are 1 best solutions below

0
On

I'm not sure if there is a truly intuitive story behind it, but I think if you understand the EM algorithm derivation you will see why we require the terms to be as they are:

First, in the standard maximum likelihood problem, we have a set of data, $X = (x_1, \dots, x_N)$ say of size $N$ and we have some density function $f(x|\theta)$ with parameter vector $\theta$. We assume that the data has been drawn from this distribution. Assume that the data vectors are independent, we can write down their joint distribution:

$$ p(X | \theta) = \prod_{i=1}^n p(x_i | \theta) = L(\theta) $$

which is also known as the likelihood function when it is taken as a function of $\theta$. The MLE estimate is then the $\theta$ that maximises the likelihood function (makes the data we observed most probable):

$$ \theta_{\text{MLE}} = \arg \max_{\theta}L(\theta) $$

or as is often done, we maximise the log likelihood:

$$ \theta_{\text{MLE}} = \arg \max_{\theta} \log L(\theta) $$

Now, let's consider the case where we need the EM algorithm. In these problems we have a hidden variable (Z) that makes things a bit tricky. One way to think of this is to think about our data containing missing values. We again have our data, $X$, which we now call the 'incomplete data'. We then assume that there exists some complete data, $Y = (X, Z)$, and we further specify a joint distribution function:

$$ p(y|\theta) = p(x,z| \theta) = p(z|x, \theta)p(x|\theta) $$

The objective for this problem is still the same, we want to maximise the (incomplete) log-likelihood, $p(X|\theta)$, and note that by the Law of Total Probability (and without loss of generality, assuming $Z$ discrete):

$$ p(X|\theta) = \sum_{z} p(X,Z|\theta) = \sum_{z} p(X|Z, \theta) p(Z|\theta). $$

Now, maximising this distribution is difficult (it is no longer a convex optimisation problem), however, it turns out that optimisation of $p(X,Z|\theta)$ is much easier.

Now, we introduce a distribution, $q(Z)$ over the latent variables. For any such distribution $q$, the following decomposition must hold:

$$ \log p(X|\theta) = \mathcal{L}(q, \theta) + \text{KL}(q||p) $$

where $$ \mathcal{L}(q, \theta) = \sum_{z} q(Z) \log \frac{p(X,Z|\theta)}{q(Z)}, \qquad \text{KL}(q || p) = - \sum_{z} q(Z) \log \frac{p(Z|X, \theta)}{q(Z)} $$

Now, you should know that $\text{KL}(q || p) \ge 0$ with equality only when $p = q$, and so it must be the case that:

$$ \mathcal{L}(q, \theta) \le \log p(X|\theta) $$

and $\mathcal{L}(q, \theta) = \log p(X|\theta)$ if and only if $\text{KL}(q || p) = 0$ i.e $q(Z) = p(Z|X, \theta)$

this is why we refer to $\mathcal{L}$ as a lower bound. So the idea is somewhat simple, we want to maximise something that is hard to maximise, so we go through a two step iterative process.

  1. E-Step: Take current parameter vector, $\theta^t$ and keep it fixed while maximising the lower bound $\mathcal{L}(q, \theta^t)$ with respect to $q$, since $\mathcal{L}(q, \theta) = \log p(X|\theta^t) - \text{KL(q||p)}$, the first term doesn't involve $q$, so maximising $\mathcal{L}$ is identical to minimising the negative KL term, i.e. by setting $q=p$.

  2. M-Step: hold $q$ fixed, and maximise $\mathcal{L}$ with respect to $\theta$. Now, unless the lower bound is already at it's maximum, this must case an increase in the lower bound, which by definition must cause $\log p(X|\theta^{t+1})$ to also increase. Note that here, $q$ is based on the old parameters $\theta$, and so $q(Z) \neq p(Z|X,\theta^{t+1})$, and so the KL term is not zero, and therefore the increase the lower bound is less than the increase in $\log p(X|\theta)$.

To see the connection between this formulation of the EM algorithm and the auxiliary (Q) function you've got, note that at the end of the E step, we have that $q(Z) = p(Z|X, \theta^t)$, and so:

\begin{align*} \mathcal{L}(q, \theta^{t+1}) &= \sum_{z} p(Z|X, \theta^t) \log \frac{p(X,Z| \theta^{t+1})}{p(Z|X, \theta^t)} \\ &= \sum_{z} p(Z|X, \theta^t) \log p(X,Z| \theta^{t+1}) - \sum_{z} p(Z|X, \theta^t) \log p(Z|X, \theta^t)\\ &= \sum_{z} p(Z|X, \theta^t) \log p(X,Z| \theta^{t+1}) + \text{const}(\theta^{t+1})\\ &= Q(\theta^{t+1}, \theta^t) + \text{const}(\theta^{t+1}) \end{align*}

where the last expression is because the second term is a constant with respect to $\theta^{t+1}$ and so plays no role in the maximisation carried out in the $M$ step. So in the M step when we maximise the lower bound, holding $q$ fixed to the posterior $p(Z|X, \theta^{t})$, we are in fact just maximising the expectation of the 'complete' log likelihood, or $Q(\cdot, \cdot)$.

So, in short, we need $Q$ to be of this form since this is the form of the decomposition of the thing we are trying to maximise: $\log p(X| \theta)$.

Reference: Chapter 9: PRML - Bishop