Upon reading a significant number of papers related to probabilistic methods of Machine Learning, some of the notation about MLE are still vague to me. So I decided to ask this question once for all and hoping it will be useful for me and other readers.
Let $X = \{x_{i} \}_{i=1}^{N}$ be the set of $\textit{N}$ data points , $x_{i}$, and let $ Y = \{y_{i}) \}_{i=1}^{N}$ be the corresponding set of label $y_{i}$, such that $x_{i} \in \mathbb{R}^{V}$, and $y_{i} \in \{0, 1, ..., K\}$ . We can define the likelihood function, as follows
$\mathcal{L}(\theta) = \prod_{i=1}^{N} p(y_{i}| x_{i}, \theta) \ \ \ \ $ Eqn.(1)
$ \ \ \ \ \ \ \ \ = P(Y|X, \theta)$
And MLE is $argmax_{\theta} = \mathcal{L}(\theta)$ (which can be obtained during some optimization algorithms or by obtatining the closed form solution of specific model).
In above equation we implicitly assumed that data points are I.I.D. And moreover, if we have intended to consider the uncertainty into account, we could assume the existence of set latent/hidden variables $Z=\{z_{j}\}_{j=1}^{N}$, during the process of data generation (Where $N$ is the number of laten variables). Therefore, we should modify the Likelihood as follows:
$\mathcal{L}(\theta) = \prod_{i=1}^{N} p(y_{i}| x_{i}, z_{i}, \theta) p(z_{i} | x_{i}, \theta) \ \ \ \ $ Eqn.(2)
where each data point should be marginalized over all the latent variables $z_{j}$, concretely,
$ p(y_{i} | x_{i}) = \sum_{i=1}^{M} p(y_{i}| x_{i}, z_{j}, \theta) p(z_{i} | x_{i}, \theta) $
Now the questions are the as follows:
- Is it correct to write $\theta$ in RHS of these equations?
- Is the Eqn. (3) the same as Eqn. (2)? I.e what does it mean if we do not write down $\theta$ in RHS? (as sometimes it is used in some papers like https://arxiv.org/pdf/1502.03044.pdf Eq.(10))
- Are my notation and assumption for taking into account the uncertainty for converting Eqn.(1) to Eqn.(2) correct?
- Is it correct that we should marginalized each data point over all the latent variables?
$\mathcal{L}(\theta) = \prod_{i=1}^{N} p(y_{i}| x_{i}, z_{i}) p(z_{i} | x_{i}) \ \ \ \ $ Eqn.(3)
Where $\theta$ is the model's parameter(s).