In the original derivation of the Expectation Maximization (EM) algorithm by Dempster et al. J. Royal Stat. Soc. B (1977), distribution of the incomplete data is given as (Equation 1.1)
$$g(\mathbf{y};\mathbf{\phi}) = \int_{\mathcal{X(y)}}f(\mathbf{x};\mathbf{\phi}) d\mathbf{x},$$
where $g(\mathbf{y};\mathbf{\phi})$ is the distribution of the observed incomplete data vector, $y$, with an unknown parameter vector, $\mathbf{\phi}$. Distribution of the unobserved complete data is denoted as $f(\mathbf{x};\mathbf{\phi})$ where $\mathbf{x}$ is the unobservable complete data. Transformation from $\mathbf{x}$ to $\mathbf{y}$, $\mathbf{y} = H(\mathbf{x})$, is a many-to-one mapping, so it is non-invertible.
I am having a hard time understanding the above equation. First of all, it seems that $g(\mathbf{y}|\mathbf{\phi})$ is a probability and not a density since the equation takes the form of integrating the complete data distribution, $f(\mathbf{x};\mathbf{\phi})$, over a subset of its domain, $\mathcal{X(y)}$.
Also, in a following key Equation 2.5, the conditional density expression
$$k(\mathbf{x}|\mathbf{y};\mathbf{\phi}) = \frac{f(\mathbf{x};\mathbf{\phi})}{g(\mathbf{y};\mathbf{\phi})}$$ is used to derive the likelihood of the unknown parameter vector, $\mathbf{\phi}$. Why does this equation not have the joint distribution, $h(\mathbf{x}, \mathbf{y};\mathbf{\phi})$, on its numerator? This implicitly means that the conditional distribution, $q(\mathbf{y}|\mathbf{x};\mathbf{\phi})=1$, since $h(\mathbf{x}, \mathbf{y};\mathbf{\phi})=q(\mathbf{y}|\mathbf{x};\mathbf{\phi})f(\mathbf{x};\mathbf{\phi})$. I think that given the complete data, $\mathbf{x}$, the observation, $\mathbf{y}$ is no longer random and so $q(\mathbf{y}|\mathbf{x};\mathbf{\phi})=\delta(\mathbf{y}-H(\mathbf{x}))\neq 1$, where $\delta(\cdot)$ is a delta function.
In other derivations of the EM algorithm, for example, notes from Thomas P. Minka 1998, the EM algorithm is interpreted and derived as lower bound maximization, and takes into account the joint distribution in the numerator. But this derivation is for situations where there is a latent random variable, rather than a non-invertible transformation between the complete and incomplete data.
I think I am missing some basic or fundamental concepts, but I just cannot figure it out. Can you please help?