The derivation of posterior predictive distribution has the following steps -
$\begin{split}p(\tilde y\mid y) = &\int p(\tilde y, \theta\mid y)~\mathsf d\theta \\ = &\int p(\tilde y\mid\theta, y)\,p(\theta\mid y)~\mathsf d\theta \\ = &\int p(\tilde y\mid\theta)\,p(\theta\mid y)~\mathsf d\theta\end{split}$
where
- $\tilde y$ : new data for prediction
- $y$ : observed data
- $\theta$ : unknown parameter.
$p(\tilde y\mid\theta, y)$ reduces to $p(\tilde y\mid\theta)$ due to conditional independence.
Can you explain why this is the case? The way I have convinced myself is that given $\tilde y$ is conditioned on $\theta$, the aspects of observed data $y$ is captured through $\theta$ and given $\tilde y\mid θ$ and $y$ are independent we can drop $y$ from the equation.
Is there a more formal explanation for this? Also why doesn't this equation reduce to $p(\tilde y\mid θ) \cdot p(y)$?
The unknown parameter $\theta$ is selected such that the prior and posterior data will be independent when its value is given. That is such that $p(y,\tilde y\mid\theta)=p(y\mid \theta)\,p(\tilde y\mid\theta)$.
So, adding a few steps:
$\begin{align}p(\tilde y\mid y) = &\int p(\tilde y, \theta\mid y)~\mathsf d\theta &&\textsf{Law of Total Probability}\\[1ex] = &\int p(\tilde y\mid\theta, y)\,p(\theta\mid y)~\mathsf d\theta&&\textsf{Definition of Conditional Probability} \\[1ex]=&\int\dfrac{p(y,\tilde y\mid\theta)\;p(\theta\mid y)}{p(y\mid\theta)\hspace{10ex}}~\mathsf d\theta&&\textsf{Definition of Conditional Probability}\\[1ex]=& \int \dfrac{p(y\mid\theta)\,p(\tilde y\mid\theta)\;p(\theta\mid y)}{p(y\mid\theta)\hspace{16ex}}~\mathsf d\theta&&\text{Via the Conditional Independence} \\[1ex] = &\int p(\tilde y\mid\theta)\,p(\theta\mid y)~\mathsf d\theta && \text{Canceling the common factor.}\end{align}$