Question on marginalizing the product of two Gaussian conditional distributions to get a predictive distribution given a new test point

42 Views Asked by At

This is from Section 1.2 of Christian Bishop's Pattern Recognition and Machine Learning. To explain some background, we shall fit data using a polynomial function of the form $$y(x, \mathbf w)=w_0+w_1 x + w_2 x^2 + \dots + w_M x^M = \sum_{j=0}^M w_j x^j.$$

The goal in the curve fitting problem is to be able to make predictions for the target variable $t$ given some new value of the input variable $x$ on the basis of a set of training data comprising $N$ input values $\mathbf x=(x_1,\dots , x_N)^T$ and their corresponding target values $\mathbf t=(t_1,\dots, t_N)^T$. Now we shall assume that given the value of $x$, the corresponding value of $t$ has a Gaussian distribution with a mean equal to the value $y(x,\mathbf w)$ of the polynomial curve given by above thus we have $$p(t|x,\mathbf w,\beta)=\mathscr N(t|y(x,\mathbf w),\beta^{-1}).$$

Then given the training data $\{\mathbf x, \mathbf t\}$, the likelihood function is given by $$p(\mathbf t|\mathbf x, \mathbf w, \beta)=\Pi_{n=1}^N \mathscr N(t_n|y(x_n,\mathbf w), \beta^{-1}).$$

Now we introduce a prior distribution over the polynomial coefficients $\mathbf w$ by the Gaussian of the form $$p(\mathbf w|\alpha)=\mathscr N(\mathbf w|\mathbf 0, \alpha^{-1}\mathbf I)=(\frac{\alpha}{2\pi})^{(M+1)/2} \exp\{-\frac{\alpha}{2}\mathbf w^T \mathbf w\}.$$

Then using Bayes' theorem the posterior distribution for $\mathbf w$ is proportional to the product of the prior distribution and the likelihood function $$p(\mathbf w|\mathbf x, \mathbf t, \alpha, \beta) \propto p(\mathbf t|\mathbf x, \mathbf w, \beta)p(\mathbf w|\alpha).$$

In the curve fitting problem, we are given the training data $\mathbf x$ and $\mathbf t$, along with a new test point $x$ and our goal is to predict the value of $t$. We therefore wish to evaluate the predictive distribution $p(t|x,\mathbf x, \mathbf t)$. Here we shall assume that the parameters $\alpha$ and $\beta$ are fixed and known in advance and so omit the dependence on $\alpha$ and $\beta$ to simplify the notation. Now the text says that we have for the predictive distribution $$p(t|x,\mathbf x,\mathbf t)=\int p(t|x,\mathbf w)p(\mathbf w|\mathbf x, \mathbf t)d\mathbf w$$ where $p(t|x,\mathbf w)$ and $p(\mathbf w| \mathbf x, \mathbf t)$ are given as above. But I can't see how marginalizing $\mathbf w$ on the RHS gives the LHS. Is this a consequence of integration or does this require properties of the Gaussian densities?

I've been stuck on these for a while, I would greatly appreciate some help.

1

There are 1 best solutions below

6
On BEST ANSWER

\begin{align} p(t \mid x, \mathbf{x}, \mathbf{t}) &= \int p(t, \mathbf{w} \mid x, \mathbf{x}, \mathbf{t}) \, d\mathbf{w} \\ &= \int p(t\mid \mathbf{w}, x, \mathbf{x}, \mathbf{t}) p(\mathbf{w} \mid x, \mathbf{x}, \mathbf{t}) \, d\mathbf{w} \\ &= \int p(t \mid x, \mathbf{w}) p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}) \, d\mathbf{w}. \end{align}

The first two equalities follow from probability axioms/definitions and are not specific to the specific setup.

For the last equality, it helps to remember that $x$ is a new input independent of the past data $(\mathbf{x}, \mathbf{t})$, and that the distribution of $t$ depends only on $x$ and $\mathbf{w}$. Specifically,

  • $p(\mathbf{w} \mid x, \mathbf{x}, \mathbf{t}) = p(\mathbf{w} \mid \mathbf{x}, \mathbf{t})$ because $x$ is conditionally independent of $\mathbf{w}$ (given $\mathbf{x}$ and $\mathbf{t}$).
  • $p(t \mid \mathbf{w}, x, \mathbf{x}, \mathbf{t}) = p(t \mid x, \mathbf{w})$ because the distribution of $t$ depends only on $x$ and $\mathbf{w}$, and is conditionally independent of $\mathbf{x}$ and $\mathbf{t}$ (given $x$ and $\mathbf{w}$).

It may help to draw a diagram of the dependencies.


Update:

It may help to distinguish between the dependencies in the data generating process and probabilistic dependence. My language was a little loose.

You can think of the data-generating process as the following happening under the hood:

  • $\mathbf{w}$ is drawn from the prior distribution. (However, we never see $\mathbf{w}$.)
  • $x$ and $\mathbf{x}$ are given (or drawn from their own distributions, independent of $\mathbf{w}$)
  • Given $\mathbf{w}$ and $\mathbf{x}$, draw $\mathbf{t}$ from the conditional distribution $p(\mathbf{t} \mid \mathbf{x}, \mathbf{w})$. In particular, $\mathbf{t}$ is conditionally independent of all other variables given $\mathbf{x}$ and $\mathbf{w}$.
  • Given $\mathbf{w}$ and $x$, draw $t$ from the conditional distribution $p(t \mid x, \mathbf{w})$. In particular, $\mathbf{t}$ is conditionally independent of all other variables given $x$ and $\mathbf{w}$.

From here, you can see that

  • $t$ is determined only by $\mathbf{w}$ and $x$ in the data-generating process. (This is what I meant when I said "the distribution of $t$ depends only on $x$ and $\mathbf{w}$ above.)
  • However, $t$ and $\mathbf{t}$ are probabilistically dependent because they are related via $\mathbf{w}$.

The "Graphical Models" chapter of your book discuss conditional independence relationships like this in more detail, it may help to jump ahead to that chapter.

Response to comments:

  • To show $p(\mathbf{w} \mid x, \mathbf{x}, \mathbf{t}) = p(\mathbf{w} \mid \mathbf{x}, \mathbf{t})$, note that the data-generating process implies the decomposition $$p(\mathbf{w}, \mathbf{x}, \mathbf{t}, x) = p(\mathbf{w}) p(\mathbf{x}) p(\mathbf{t} \mid \mathbf{x}, \mathbf{w}) p(x).$$ In particular $p(\mathbf{w}, \mathbf{x}, \mathbf{t}, x) = p(\mathbf{w}, \mathbf{x}, \mathbf{t}) p(x)$. So $$p(\mathbf{w} \mid x, \mathbf{x}, \mathbf{t}) = \frac{p(\mathbf{w}, \mathbf{x}, \mathbf{t}, x)}{p(\mathbf{x}, \mathbf{t}, x)} = \frac{p(\mathbf{w}, \mathbf{x}, \mathbf{t}, x)}{\int p(\mathbf{w}, \mathbf{x}, \mathbf{t}, x) \, d\mathbf{w}} = \frac{p(\mathbf{w}, \mathbf{x}, \mathbf{t})}{\int p(\mathbf{w}, \mathbf{x}, \mathbf{t}) \, d\mathbf{w}} = p(\mathbf{w} \mid \mathbf{x}, \mathbf{t}).$$
  • In the predictive distribution of $t$, the influence of $\mathbf{x}$ and $\mathbf{t}$ is is that they give you more "information" about the hidden variable $\mathbf{w}$, which would in turn influence $t$. The integral quantifies this idea by decomposing it into the generating distribution $p(t \mid x, \mathbf{w})$ and the posterior $p(\mathbf{w} \mid \mathbf{x}, \mathbf{t})$ which captures our updated guess about the distribution of $\mathbf{w}$ after having observed $\mathbf{x}$ and $\mathbf{t}$.