Machine Learning/Statistics - Factor Analysis Proof, Stuck

183 Views Asked by At

I'm currently following this guide from stanford's CS229 machine learning class for Factor Analysis. I followed through every point except for the following:

http://cs229.stanford.edu/notes/cs229-notes9.pdf , page 7-8.

It states that $\mu = E_{z|x}[z]$ after the proof using maximum likelihood.

For my attempt of the proof, consider the equation (6) on page 7:

The only terms that rely on $\mu$ is: $\sum_n\sum_{z_n}p(z_n|x_n)[-\frac{1}{2}(x-\mu-\Lambda)^T\Psi^{-1}(x-\mu-\Lambda)]$,

or even more concisely: $\sum_n\sum_{z_n}p(z_n|x_n)[-\frac{1}{2}(-2x^T\Psi^{-1}\mu+\mu\Psi^{-1}\mu + 2z_n^T\Lambda^T\Psi^{-1}\mu)]$

We then can express it as an expectation after taking gradient with respect to $\mu$:

$\sum_nE_{z|x}[-(\Psi^{-1}\mu + \Psi^{-1}(\Lambda z_n + x_n))] = 0$

Then we get:

$\sum_n\mu = \sum_nx_n + E(\Lambda z_n)$

Thus we get:

$\mu = \frac{\sum_nx_n + E(\Lambda z_n)}{N}$.

The answer in the notes, says that it is actually:

$\mu = \frac{\sum_nx_n}{N}$. So I am a little bit off. I can't find exactly where I could drop the expected value.

Is there some argument for which $E(\Lambda z_n) = 0$?

1

There are 1 best solutions below

1
On BEST ANSWER

First your notation $\sum_{z_n}$ seems to suggest that $z_n$ has discrete support and is better left as either the usual expectation symbol or an integral.

Also I think a minus has gone missing somewhere in your expression, taking the derivative through the expectation and differentiating the quadratic form and setting equal to zero I get \begin{align} 0 &= \sum_i \mathbb{E} \left( \mathbf{x}^{(i)} - \mu - \Lambda \mathbf{z}^{(i)} \right)^T \Psi^{-1} \\ &= \sum_i \left( \mathbf{x}^{(i)} - \mu - \Lambda \mu_{z^{(i)}|x^{(i)}}\right)^T\Psi^{-1}, \end{align} then you can write this system as \begin{align} \sum_i (\mathbf{x}^{(i)} - \mu) &= \sum_i \Lambda \mu_{z^{(i)}|x^{(i)}} \\ &= \sum_i \Lambda \Lambda^{T}\left( \Lambda \Lambda^T + \Psi \right)^{-1} (\mathbf{x}^{(i)} - \mu) \end{align} or $$ \mathbf{y} = \Lambda \Lambda^{T}\left(\Lambda \Lambda^T + \Psi \right)^{-1} \mathbf{y}. $$ Now let $C = \Lambda \Lambda^T$, then $$ \left(C (C + \Psi)^{-1} - I \right) \mathbf{y} = \mathbf{0}, $$ has only the trivial solution $\mathbf{y} = \mathbf{0}$ if $\mbox{det}(C (C+\Psi)^{-1} - I) \neq 0$, but $$ \begin{align} C(C+\Psi)^{-1} - I &= C(C+\Psi)^{-1} - (C+\Psi)(C+\Psi)^{-1} \\ &= \left( C - C +\Psi \right) \cdot (C + \Psi)^{-1} \\ &= \Psi \cdot (C+\Psi)^{-1}, \end{align} $$ and presumably at some point previously we have made the assumption that we have a nonsingular covariance matrix so that $\mbox{det}(\Psi) \neq 0$. Therefore the only solution is the trivial one and we can conclude that $$ \sum_i \mathbf{x}^{(i)} - N \mathbf{\mu} = \mathbf{0}. $$