Why 2-coin toss example used in expectation maximization is not just Binomial?

55 Views Asked by At

I'm learning about expectation-maximization, and a classic example is the two coins example:

  • First, toss coin 0 with probability $\lambda$ of success.
  • If coin 0 was a success, then toss coin 1 with probability $p_1$ of success.
  • Otherwise toss coin 2 with probability $p_2$ of success.

Do this $n$ times to produce $x_1,...,x_n := X$.

The example then introduces 'missing data' which is 'from which coin each sample was derived' and uses the law of total probability etc...

My questions is -- is it possible to skip this and claim that the above experimental procedure is Binomial with $n$ experiments and probability $\lambda p_1+(1-\lambda)p_2$? If so, how could one justify this formally?

EDIT: (thinking 'out loud' here): I know that two consecutive Binomials is Binomial with probability the product of their probabilitis, so if we only had coins 0,1 we would have Binomial with probability $\lambda p_1$ of success. I would have said the same for $(1-\lambda)p_2$ if it were $\lambda$ instead of $1-\lambda$, in which case: $\lambda p_1+\lambda p_2$ looks like a sum of Binomials, which would be Binomial (if $p_2=p_1$). However, since we have $1-\lambda$, and in any case $p_2\neq p_1$, then I'm starting to think that perhaps the answer is no, and that the only case in which that would be guaranteed is if $\lambda p_1 = (1-\lambda)p_2$. I might be way off here with my intuition... Still not formally refuted though.

Thanks.

1

There are 1 best solutions below

0
On BEST ANSWER

Let $X$ represent the number of successes observed from coins $1$ and $2$ out of a total of $n$ trials, each trial beginning with coin $0$ but not including its outcome in $X$. Let $Y_i$ represent the $i^{\rm th}$ trial of coin $0$, where $\Pr[Y_i = 1] = \lambda$ is the probability of success. Then if $X_i$ is the outcome of the $i^{\rm th}$ trial, we must have $$X_i \mid Y_i \sim \operatorname{Bernoulli}\bigl(p = Y_i p_1 + (1-Y_i) p_2\bigr); \\ \Pr[X_i = 1 \mid Y_i] = Y_i p_1 + (1-Y_i) p_2.$$ that is to say, the outcome of the $i^{\rm th}$ trial is conditionally Bernoulli, with success parameter $p_1$ if $Y_i = 1$ and $p_2$ if $Y_i = 0$.

Then by the law of total probability, the marginal/unconditional distribution of $X_i$ is

$$\begin{align} \Pr[X_i = 1] &= \Pr[X_i = 1 \mid Y_i = 0]\Pr[Y_i = 0] + \Pr[X_i = 1 \mid Y_i = 1]\Pr[Y_i = 1] \\ &= p_2 (1-\lambda) + p_1 \lambda. \end{align}$$

Note that the unconditional distribution of $X_i$ must also be Bernoulli, but its success parameter is $p_2 (1-\lambda) + p_1 \lambda$ as we found above. And because these are IID, the distribution of $X$ is simply binomial with $n$ trials and the same success parameter.

The fact that $X$ is binomial necessarily follows from the fact that the unconditional distributions of the $X_i$ are independent. To illustrate, if the experiment had been to flip coin $0$ once, and use this single result to determine which of the two coins to flip for all of the $n$ trials that comprise $X$, then we would see that $$X \mid Y_1 \sim \operatorname{Binomial}\bigl(n, p = Y_i p_1 + (1-Y_i)p_2\bigr),$$ and this is very different than the previous situation, since now by the law of total probability, $$\begin{align} \Pr[X = x] &= \Pr[X = x \mid Y_1 = 0] \Pr[Y_1 = 0] + \Pr[X = x \mid Y_1 = 1]\Pr[Y_1 = 1] \\ &= \binom{n}{x} p_2^x (1-p_2)^{n-x} (1-\lambda) + \binom{n}{x} p_1^x (1-p_1)^{n-x} \lambda \\ &= \binom{n}{x} \left( (1-\lambda) p_2^x (1-p_2)^{n-x} + \lambda p_1^x (1-p_1)^{n-x} \right), \end{align}$$

and this distribution is not binomial in general. This occurs because in this version of the experiment, the unconditional $X_i$ are not independent; they are correlated through the outcome of coin $0$. To be clear, $X_i \mid Y_1$ are IID, because once you flip coin $0$ and know whether you will be using coin $1$ or coin $2$ for $X$, then those coin tosses are IID. But you can see that the unconditional distribution does not work this way by imagining that a third party has flipped coin $0$, observed the result, and on this basis, handed you either coin $1$ or coin $2$ accordingly. If the coins are not visually distinguishable but they have different success parameters that are known to you, the act of flipping the coin becomes informative of which coin you were given. However, in the previous scenario, because each trial is fully independent--coin $0$ is tossed each time--the outcome of one trial does not affect the outcome of any other trial.