I'm quite new to probability theory, yet I'm reading on deep learning and trying to understand some basic concepts.
They write down Bayes rule in the following form:
$$ p(\theta|D) = \frac{p(\theta)p(D|\theta)}{p(D)} = \frac{p(\theta)p(D|\theta)}{\int_{\theta \in \Theta}p(D|\theta)p(\theta)d\theta}$$
where:
- $\theta$ - parameters of the model
- D - input data
- $p(\theta)$ - prior probability
- $p(D|\theta)$ - likelihood
- $p(\theta|D)$ - posterior probability
- $p(D) = \int_{\theta \in \Theta}p(D|\theta)p(\theta)d\theta$ - evidence
Now, they introduce what they call predictive probability which is denoted as $p(y|D)$ or $p(y|D,x)$ where y is the right answer for the next question x. Then they write: $$p(y|D) = \int_{\Theta}p(y|\theta)p(\theta|D)d\theta \propto \int_{\Theta}p(y|\theta)p(\theta)p(D|\theta)d\theta$$
At this point, I'm lost. Please could anybody explain in detail this marginalization?
By marginalising the joint distribution you have $$ \begin{align} p(y|D) &= \int_{\Theta} p(y, \theta | D)\operatorname{d}\theta \\ &= \int_{\Theta} p(y|\theta, D)p(\theta |D)\operatorname{d}\theta, \end{align} $$ and since you seem happy with the claim that $p(\theta|D) \propto p(\theta)p(D|\theta)$, then it remains to claim that $$ p(y|\theta, D)=p(y|\theta), $$ which is an assumption, but it is one that is normally assumed true in the context of parameterised statistical models where the parameters, $\theta$, completely specify the conditional distribution. In the particular problem you are considering it amounts to the assumption that; "the distribution of the random variable $Y$ at the input point $x$ is completely specified by a parameter $\theta$" - that is all that you once you know a point $x$ and a parameter $\theta$ you know all you need to know about the distribution of $Y$, or in terms of the density functions that for any random variable $Z$ you have that $p(Y|x, \theta, Z) = p(Y|x, \theta)$.