Distribution of the target given all data and respective targets and new input - Bishops' PRML book

87 Views Asked by At

In 1.2.6 he gives this equation (1.68):

$$p(t|x,\mathbb{x},\mathbb{t}) = \int p(t|x,\mathbb{w})p(\mathbb{w}|\mathbb{x},\mathbb{t}) d\mathbb{w} \tag{1.68}$$

And this is just a distribution over $t$ given the new data point $x$ and all the data $\mathbb{x}$ and respective targets $\mathbb{t}$. But the author does not give a motivation of why the LHS is equal to the RHS. And I am interested in precisely knowing how to write out the expression on the RHS, if I were given the expression on the LHS.

To give some more details: $$p(t|x,\mathbb{w})= \mathcal{N}(t|y(x,\mathbb{w}),\beta^{-1}) \tag{1.60}$$ and $$p(\mathbb{w}|\mathbb{x},\mathbb{t},\alpha,\beta) \propto p(\mathbb{t}|\mathbb{x},\mathbb{w},\beta)p(\mathbb{w}|\alpha)$$ i.e. a posterior distribution is proportional to likelihood times prior. So we would obtain the $p(\mathbb{w}|\mathbb{x},\mathbb{t})$ term in the integral by normalizing the distribution in (1.60).

1

There are 1 best solutions below

4
On BEST ANSWER

The problem is one of predicting a new target at a given, non-random, input point $x$ where $\mathbb{w}$ is distributed under the posterior having observed the targets $\mathbb{t}$ at the input points $\mathbb{x}$, therefore the posterior only has a dependence on those input points to which there also corresponds some observed target, that is $$ p(\mathbb{w}|x,\mathbb{x},\mathbb{t})=p(\mathbb{w}|\mathbb{x},\mathbb{t}) $$ so then by marginalising over $\mathbb{w}$ you have $$ \begin{align*} p(t|x,\mathbb{x},\mathbb{t}) &= \int p(t, \mathbb{w} | x, \mathbb{x}, \mathbb{t} ) d\mathbb{w} \\ &= \int p(t | \mathbb{w}, x, \mathbb{x}, \mathbb{t} ) p(\mathbb{w} | x, \mathbb{x}, \mathbb{t}) d\mathbb{w} \\ &= \int p(t|\mathbb{w},x,\mathbb{x},\mathbb{t})p(\mathbb{w}|\mathbb{x},\mathbb{t})d\mathbb{w} \end{align*} $$ Now it follows from $t|x,\mathbb{w} \sim \mathcal{N}\left(y(x,\mathbb{w}), \beta^{-1} \right)$ that the distribution of the target $t$ at a new point $x$ depends only on that point and the parameters $\mathbb{w}$, as well as the parameter $\beta$ but that is being suppressed here for keeping the notation compact, and therefore $$ p(t|\mathbb{w},x,\mathbb{x},\mathbb{t})= p(t|\mathbb{w},x), $$ which gives the quoted result.

In summary one goes from the fully specified conditionals to their more compact representations by considering information that becomes redundant due to any conditional independence properties of the model. So for example conditional on the specific parameters $\mathbb{w}$ of the model, knowing the input points $\mathbb{x}$ and $\mathbb{t}$ from which the posterior information regarding these parameters arose gives us no new information regarding the distribution of $t$ at a new point $x$ because conditional on their respective input points, and the model parameters, each target is independent $$ \begin{align*} p(t, \mathbb{t}|x,\mathbb{x},\mathbb{w} ) &= \mathcal{N}(t|y(x,\mathbb{w}),\beta^{-1})\prod_{i \; : \; x_i \in \mathbb{x} }\mathcal{N}\left(t_i | y(x_i, \mathbb{w}), \beta^{-1} \right) \\ &= p(t|x,\mathbb{w})p(\mathbb{t}|\mathbb{x},\mathbb{w}). \end{align*} $$