This is from Section 1.2 of Christian Bishop's Pattern Recognition and Machine Learning. To explain some background, we shall fit data using a polynomial function of the form $$y(x, \mathbf w)=w_0+w_1 x + w_2 x^2 + \dots + w_M x^M = \sum_{j=0}^M w_j x^j.$$
The goal in the curve fitting problem is to be able to make predictions for the target variable $t$ given some new value of the input variable $x$ on the basis of a set of training data comprising $N$ input values $\mathbf x=(x_1,\dots , x_N)^T$ and their corresponding target values $\mathbf t=(t_1,\dots, t_N)^T$. Now we shall assume that given the value of $x$, the corresponding value of $t$ has a Gaussian distribution with a mean equal to the value $y(x,\mathbf w)$ of the polynomial curve given by above thus we have $$p(t|x,\mathbf w,\beta)=\mathscr N(t|y(x,\mathbf w),\beta^{-1}).$$
Then given the training data $\{\mathbf x, \mathbf t\}$, the likelihood function is given by $$p(\mathbf t|\mathbf x, \mathbf w, \beta)=\Pi_{n=1}^N \mathscr N(t_n|y(x_n,\mathbf w), \beta^{-1}).$$
Now we introduce a prior distribution over the polynomial coefficients $\mathbf w$ by the Gaussian of the form $$p(\mathbf w|\alpha)=\mathscr N(\mathbf w|\mathbf 0, \alpha^{-1}\mathbf I)=(\frac{\alpha}{2\pi})^{(M+1)/2} \exp\{-\frac{\alpha}{2}\mathbf w^T \mathbf w\}.$$
Then using Bayes' theorem the posterior distribution for $\mathbf w$ is proportional to the product of the prior distribution and the likelihood function $$p(\mathbf w|\mathbf x, \mathbf t, \alpha, \beta) \propto p(\mathbf t|\mathbf x, \mathbf w, \beta)p(\mathbf w|\alpha).$$
In the curve fitting problem, we are given the training data $\mathbf x$ and $\mathbf t$, along with a new test point $x$ and our goal is to predict the value of $t$. We therefore wish to evaluate the predictive distribution $p(t|x,\mathbf x, \mathbf t)$. Here we shall assume that the parameters $\alpha$ and $\beta$ are fixed and known in advance and so omit the dependence on $\alpha$ and $\beta$ to simplify the notation. Now the text says that we have for the predictive distribution $$p(t|x,\mathbf x,\mathbf t)=\int p(t|x,\mathbf w)p(\mathbf w|\mathbf x, \mathbf t)d\mathbf w$$ where $p(t|x,\mathbf w)$ and $p(\mathbf w| \mathbf x, \mathbf t)$ are given as above. But I can't see how marginalizing $\mathbf w$ on the RHS gives the LHS. Is this a consequence of integration or does this require properties of the Gaussian densities?
I've been stuck on these for a while, I would greatly appreciate some help.
\begin{align} p(t \mid x, \mathbf{x}, \mathbf{t}) &= \int p(t, \mathbf{w} \mid x, \mathbf{x}, \mathbf{t}) \, d\mathbf{w} \\ &= \int p(t\mid \mathbf{w}, x, \mathbf{x}, \mathbf{t}) p(\mathbf{w} \mid x, \mathbf{x}, \mathbf{t}) \, d\mathbf{w} \\ &= \int p(t \mid x, \mathbf{w}) p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}) \, d\mathbf{w}. \end{align}
The first two equalities follow from probability axioms/definitions and are not specific to the specific setup.
For the last equality, it helps to remember that $x$ is a new input independent of the past data $(\mathbf{x}, \mathbf{t})$, and that the distribution of $t$ depends only on $x$ and $\mathbf{w}$. Specifically,
It may help to draw a diagram of the dependencies.
Update:
It may help to distinguish between the dependencies in the data generating process and probabilistic dependence. My language was a little loose.
You can think of the data-generating process as the following happening under the hood:
From here, you can see that
The "Graphical Models" chapter of your book discuss conditional independence relationships like this in more detail, it may help to jump ahead to that chapter.
Response to comments: