I am learning the proofs about general properties for the EM algorithm in Dempster's 1977 EM algorithm paper 1. In it I found that the proof of Lemma 1, which is attached below, is referred to a conclusion in Rao's statistical inference book 2.
But when I checked the (1e5.6) formulae, I found it is just the Jensen's equation. While (1e6.6), which is attached below, has an extra application condition that $\int_S(f-g)\geq0$, I suppose there shall be a corresponding relationship between $\phi$ and $\phi'$. In contrast, Dempster's lemma 1 says that $\phi$ and $\phi'$ could be any pairs, which makes me confused. Could you please tell me which part of my thinking is wrong or an entire proof for this lemma?
I also found a related question on this topic: Jensen's inequality in derivation of EM algorithm. However, it is not totally the same.
1 Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." Journal of the Royal Statistical Society: Series B (Methodological) 39, no. 1 (1977): 1-22.
2 Rao, C.R., 1973. Linear statistical inference and its applications. New York: Wiley.


I am thinking about this too. What I think is that the inequality can be showed using, that the KL-divergence is always greater than zero. I am using the definitions from Wikipedia here. Basically $z$ is the hidden variable and $x$ is the observed variable.
$$ H(\phi | \phi^{(t)}) \geq H(\phi^{(t)} | \phi^{(t)}) $$ By inserting the definitions and multiplying by $-1$, we get $$ \sum_z p(z|x,\phi^{(t)}) \log p(z|x,\phi) \leq \sum_z p(z|x,\phi^{(t)}) \log p(z|X,\phi^{(t)}) $$ By bringing the sum on the left side we get $$ \sum_z p(z|x,\phi^{(t)}) \log \frac {p(z|x,\phi)} {p(z|x,\phi^{(t)})} ) \leq 0 $$ This is basically the KL-divergence $$ \sum_z p(z|x,\phi^{(t)}) \log \frac {p(z|x,\phi^{(t)})} {p(z|X,\phi)} ) \geq 0 $$
This is also what is said in the definition of Jensen's inequality you mentioned.
Please note, that I am pretty new to this topic and unsure about the above solution.