I am new to GP/non-parametric regression. I am reading Rasmussen's book on Gaussian processes. In Eq. 2.28, 2.29 (Page 19) and in the subsequent passage he writes the marginal likelihood as the integral of likelihood times prior.
$$ p(\textbf{y}|X) = \int p(\textbf{y}|\textbf{f},X)p(\textbf{f}|X)\,d\textbf{f} $$
I understand the prior is the GP-prior (the model). $p(\textbf{f}|X)$ part is clear to me. This is assumed by us to be a multivariate Gaussian $\mathcal{N}(0,\textbf{K})$. But the book immediately says after this that the likelihood is a factorized Gaussian of the form $\textbf{y|f} \sim \mathcal{N}(\textbf{f},\sigma_{n}^{2}\textbf{I})$. I don't get how we can make this jump directly. How did this understanding come into play??
I know the regression model looks like $y = f(x) + \epsilon_{n}$. The model for $f(x)$ is our assumption : The same GP-prior as discussed above - $\mathcal{N}(0,\textbf{K})$. The noise are i.i.d Gaussian which is $\mathcal{N}(0,\sigma_{n}^{2})$. So the target distribution $\textbf{y}$ will be $$\mathcal{N}(0,\textbf{K} + \sigma_{n}^{2}\textbf{I})$$. The way I understand this is $y$ is a sum of 2 independent normal distributions. So we can use the property of sum of 2 normals.
But I don't get the "$\textbf{y|f} \sim \mathcal{N}(\textbf{f},\sigma_{n}^{2}\textbf{I})$" part from the discussion. Could anyone please enlighten me how we can directly assume this. I am sorry if this is a silly question, but I am really new to this.
Thanks in advance.
The link to the book: http://www.gaussianprocess.org/gpml/
The key is the assumption of additive independent identically distributed Gaussian noise $\epsilon_n$, i.e. the assumption that observations are given by $\textbf{y} = \textbf{f} + \epsilon_{n}$ where $\epsilon_n \sim \mathcal{N}(0,\sigma_{n}^{2}I)$ independent from $\textbf{f}$. It should be intuitively clear that if you know the noise-free value $\textbf{f}$ then you should expect the observation $\textbf{y}$ to be a Gaussian centered on $\textbf{f}$ and with the covariance matrix of the noise.
We can show this more rigorously by deriving the conditional distribution $\textbf{y|f}$. First, note the distribution of $\textbf{f}$ and of $\epsilon_n$:
$$ \textbf{f} \sim \mathcal{N}(0, K) \\ \epsilon_n \sim \mathcal{N}(0, \sigma_n^2 I). $$
We need to know the joint distribution of $\textbf{f}$ and $\textbf{y}$, so we will calculate the covariance matrices of $\textbf{y}$ and of $\textbf{y}$ with $\textbf{f}$. Since $\textbf{y} = \textbf{f} + \epsilon_n$ we have
$$ \textbf{y} \sim \mathcal{N}(0, K + \sigma_n^2 I) $$
by the properties of idependent Gaussian distributions. Also,
$$ \mathrm{Cov}(\textbf{y}, \textbf{f}) = \mathrm{Cov}(\textbf{f}, \textbf{f}) + \mathrm{Cov}(\epsilon_n, \textbf{f}) = K + 0 = K. $$
Collect the results above into a statement of the joint distribution of $\textbf{y}$ and $\textbf{f}$
$$ \begin{bmatrix} \textbf{y} \\ \textbf{f} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} K + \sigma_n^2 I & K \\ K & K \end{bmatrix} \right) $$
where in the bottom left corner we used the fact that $K = K^T$.
Finally, we find the conditional distribution $\textbf{y|f}$ using identity (A.6) on page 200 in appendix A.2 in Rasmussen (also to be found here)
$$ \textbf{y} | \textbf{f} \sim \mathcal{N}(0 + K K^{-1}(\textbf{f} - 0), K + \sigma_n^2 I - KK^{-1}K^T) \\ \textbf{y} | \textbf{f} \sim \mathcal{N}(\textbf{f}, \sigma_n^2 I) $$
as expected.