Plain English interpretation needed for the sentence to understand EM-algorithm?

103 Views Asked by At

enter image description here

I am trying to read an EM-algorithm article on the web, however, as soon as I started I have face a sentence interpretation problem with this like "... in the presence of missing or hidden data" in the paragraph above.

What does he/she mean by "missing or hidden data"? In some articles, missing or hidden data means incomplete data, but it even makes me confused.

2

There are 2 best solutions below

5
On

Here's an example of missing data. Suppose $X_1,\ldots,X_n \sim \text{ i.i.d. } N_d(0\in\mathbb R^d, V\in\mathbb R^{d\times d})$. This is a $d$-dimensional normal (or "Gaussian") distribution. The variance $V$ is a $d\times d$ non-negative-definite symmetric matrix: $$ \operatorname{var}(X_1) = \operatorname{E}\Big( (X_1-0)(X_1-0)^T \Big) \in \mathbb R^{d\times d}. $$ (I subtract $0$ just as a reminder that one subtracts the expected value.) It is desired to estimate $V$. It is known that the maximum-likelihood estimate when there is no missing data is the sample variance $$ \frac 1 n \sum_{i=1}^n (X_i - 0)(X_i-0)^T \in \mathbb R^{d\times d}. $$ That is the value of $V$ that maximizes the likelihood function $$ V\mapsto L(V) = \prod_{i=1}^n \frac 1 {\sqrt{2\pi}\sqrt{|\det V|}} \exp\left( -(x_i - 0)^T V^{-1} (x_i-0) \right). $$ Now suppose that in some of the vectors $X_1,\ldots,X_n$ some of the components are not reported. We have, for example $$ X_3 = \left[ \begin{array}{r} 5.2 \\ 4.3 \\ -2.1 \\ \text{?} \\ 8.6 \end{array} \right]. $$ The value of the fourth component is not known. That is missing data. We may know the value of the fourth component for every $i\ne3$, but not that one. The distribution of the other components is still determined by the same probability density function. Maximizing the likelihood function in that context lacks a closed-form solution and numerical methods such as the EM algorithm must be used.

Here's another example: The correlation between a man's height at the age of $25$ and his parents' heights at the age of $25$ is to be examined. Data are available for $100$ men. In every case we find the man's height and in most cases we find the heights of both parents. But in some cases the height of only one of the two parents is reported. That is missing data.

0
On

It just means there is (at least) a feature in a sample where its information is missing.