In class we've defined the following:
$$Q(\theta; \theta^t) = \sum_z P(Z=z\mid X=x; \theta^t) \log P(X=x; Z=z;\theta)$$
It's part of the EM algorithm. Here, $\theta^t$ are the assumed parameters at time $t$.
I just wonder, why wasn't it defined as conditional probability, like this: $$\log P(X=x \mid Z=z;\theta)$$
I'm looking for the intuition here, other than "it's the definition".
Thanks
I'm not sure if there is a truly intuitive story behind it, but I think if you understand the EM algorithm derivation you will see why we require the terms to be as they are:
First, in the standard maximum likelihood problem, we have a set of data, $X = (x_1, \dots, x_N)$ say of size $N$ and we have some density function $f(x|\theta)$ with parameter vector $\theta$. We assume that the data has been drawn from this distribution. Assume that the data vectors are independent, we can write down their joint distribution:
$$ p(X | \theta) = \prod_{i=1}^n p(x_i | \theta) = L(\theta) $$
which is also known as the likelihood function when it is taken as a function of $\theta$. The MLE estimate is then the $\theta$ that maximises the likelihood function (makes the data we observed most probable):
$$ \theta_{\text{MLE}} = \arg \max_{\theta}L(\theta) $$
or as is often done, we maximise the log likelihood:
$$ \theta_{\text{MLE}} = \arg \max_{\theta} \log L(\theta) $$
Now, let's consider the case where we need the EM algorithm. In these problems we have a hidden variable (Z) that makes things a bit tricky. One way to think of this is to think about our data containing missing values. We again have our data, $X$, which we now call the 'incomplete data'. We then assume that there exists some complete data, $Y = (X, Z)$, and we further specify a joint distribution function:
$$ p(y|\theta) = p(x,z| \theta) = p(z|x, \theta)p(x|\theta) $$
The objective for this problem is still the same, we want to maximise the (incomplete) log-likelihood, $p(X|\theta)$, and note that by the Law of Total Probability (and without loss of generality, assuming $Z$ discrete):
$$ p(X|\theta) = \sum_{z} p(X,Z|\theta) = \sum_{z} p(X|Z, \theta) p(Z|\theta). $$
Now, maximising this distribution is difficult (it is no longer a convex optimisation problem), however, it turns out that optimisation of $p(X,Z|\theta)$ is much easier.
Now, we introduce a distribution, $q(Z)$ over the latent variables. For any such distribution $q$, the following decomposition must hold:
$$ \log p(X|\theta) = \mathcal{L}(q, \theta) + \text{KL}(q||p) $$
where $$ \mathcal{L}(q, \theta) = \sum_{z} q(Z) \log \frac{p(X,Z|\theta)}{q(Z)}, \qquad \text{KL}(q || p) = - \sum_{z} q(Z) \log \frac{p(Z|X, \theta)}{q(Z)} $$
Now, you should know that $\text{KL}(q || p) \ge 0$ with equality only when $p = q$, and so it must be the case that:
$$ \mathcal{L}(q, \theta) \le \log p(X|\theta) $$
and $\mathcal{L}(q, \theta) = \log p(X|\theta)$ if and only if $\text{KL}(q || p) = 0$ i.e $q(Z) = p(Z|X, \theta)$
this is why we refer to $\mathcal{L}$ as a lower bound. So the idea is somewhat simple, we want to maximise something that is hard to maximise, so we go through a two step iterative process.
E-Step: Take current parameter vector, $\theta^t$ and keep it fixed while maximising the lower bound $\mathcal{L}(q, \theta^t)$ with respect to $q$, since $\mathcal{L}(q, \theta) = \log p(X|\theta^t) - \text{KL(q||p)}$, the first term doesn't involve $q$, so maximising $\mathcal{L}$ is identical to minimising the negative KL term, i.e. by setting $q=p$.
M-Step: hold $q$ fixed, and maximise $\mathcal{L}$ with respect to $\theta$. Now, unless the lower bound is already at it's maximum, this must case an increase in the lower bound, which by definition must cause $\log p(X|\theta^{t+1})$ to also increase. Note that here, $q$ is based on the old parameters $\theta$, and so $q(Z) \neq p(Z|X,\theta^{t+1})$, and so the KL term is not zero, and therefore the increase the lower bound is less than the increase in $\log p(X|\theta)$.
To see the connection between this formulation of the EM algorithm and the auxiliary (Q) function you've got, note that at the end of the E step, we have that $q(Z) = p(Z|X, \theta^t)$, and so:
\begin{align*} \mathcal{L}(q, \theta^{t+1}) &= \sum_{z} p(Z|X, \theta^t) \log \frac{p(X,Z| \theta^{t+1})}{p(Z|X, \theta^t)} \\ &= \sum_{z} p(Z|X, \theta^t) \log p(X,Z| \theta^{t+1}) - \sum_{z} p(Z|X, \theta^t) \log p(Z|X, \theta^t)\\ &= \sum_{z} p(Z|X, \theta^t) \log p(X,Z| \theta^{t+1}) + \text{const}(\theta^{t+1})\\ &= Q(\theta^{t+1}, \theta^t) + \text{const}(\theta^{t+1}) \end{align*}
where the last expression is because the second term is a constant with respect to $\theta^{t+1}$ and so plays no role in the maximisation carried out in the $M$ step. So in the M step when we maximise the lower bound, holding $q$ fixed to the posterior $p(Z|X, \theta^{t})$, we are in fact just maximising the expectation of the 'complete' log likelihood, or $Q(\cdot, \cdot)$.
So, in short, we need $Q$ to be of this form since this is the form of the decomposition of the thing we are trying to maximise: $\log p(X| \theta)$.
Reference: Chapter 9: PRML - Bishop