Optimisation vs. Bayes' Theorem not coinciding

111 Views Asked by At

Suppose I have the following Bayesian Network:

enter image description here

It's given by the following relations: $$\begin{aligned}X_1&\sim \mathcal N(\mu, 1/\sigma^2)\\ \forall k, 2\leq k\leq n: X_k|X_{k-1}&\sim \mathcal N(x_{k-1}, 1/\lambda^2)\\ \forall i, 1\leq i\leq n: Z_i|X_i&\sim\mathcal N(x_{i}, 1/\delta_i^2) \end{aligned}$$ I'm using the precision rather than the variance for a reason that will become clear soon.

So, the above is a very simplified Kalman filter, in which there is no control-input model and both the state-transition model and the observation model are the identity, and furthermore the precision of the process noise is the same at every step.

Under that definition, $\boldsymbol X$ is the hidden variable I want to reason about, and $\boldsymbol Z$ is the set of (noisy) observations I've collected. I want to find the value $\hat{x}_n$ that maximises the distribution $p(X_n = x_n|\boldsymbol Z)$ implied by the above relations (i.e. the MAP estimate of the last value).

There are two ways I can go about doing this.


The first is with a straightforward application of Bayes' Theorem. Let $\boldsymbol X_{\{a, b, c, ...\}}$ stand for $\boldsymbol X\backslash\{X_a, X_b, X_c, ...\}$ and let $\boldsymbol Z_{\{a, b, c, ...\}}$ be defined similarly. Then:

$$\begin{aligned} &p(\boldsymbol X|\boldsymbol Z) \propto p(\boldsymbol Z, \boldsymbol X) \\ &= p(\boldsymbol Z|\boldsymbol X) * p(\boldsymbol X) \\ &=p(Z_n|X_n, \boldsymbol Z_{\{n\}}, \boldsymbol X_{\{n\}}) * p(\boldsymbol Z_{\{n\}} | \boldsymbol X) * p(X_n | X_{n-1}, \boldsymbol X_{\{n, n-1\}}) * p(\boldsymbol X_{\{n\}}) \\ &=p(Z_n|X_n) * p(X_n | X_{n-1}) * p(\boldsymbol Z_{\{n\}} | \boldsymbol X_{\{n\}}) * p(\boldsymbol X_{\{n\}}) \\ &= \left[\prod_{i=2}^np(Z_i|X_i)*p(X_i|X_{i-1})\right]*p(Z_1|X_1)*p(X_1) &(1)\\ &= \left[\prod_{i=2}^n\mathcal N(Z_i|X_i, 1/\delta_i^2)*\mathcal N(X_i|X_{i-1}, 1/\lambda^2)\right]*\mathcal N(Z_1|X_1, 1/\delta_1^2)*\mathcal N(X_1|\mu, 1/\sigma^2) &(2) \end{aligned}$$

After some manipulation of $(2)$, if we define $X_0=\mu$, it's not hard to see that:

$$\begin{aligned} p(\boldsymbol X|\boldsymbol Z) \propto \prod_{i=1}^n\mathcal N(X_i | \alpha_i * Z_i + (1 - \alpha_i) * X_{i - 1}, \beta_i) &&(3)\end{aligned}$$

Where we've defined:

$$\begin{aligned} \alpha_1 &= \frac{\sigma^2}{\sigma^2 + \delta_1^2} \\ \forall k, 2\leq k\leq n:\alpha_k&=\frac{\lambda^2}{\lambda^2+\delta_k^2} \\ \beta_1 &= \frac{1}{\sigma^2} + \frac{1}{\delta_1^2} \\ \forall k, 2\leq k\leq n:\beta_k&=\frac{1}{\lambda^2} + \frac{1}{\delta_k^2} \end{aligned}$$

Although the precisions $\beta_k$ are irrelevant for our purposes (I think?).

So, the MAP estimates for the values of $\boldsymbol X$ can be given recursively by:

$$\begin{aligned} \hat x_1 &= \alpha_1 * z_1 + (1 - \alpha_1) * \mu \\ \forall k, 2\leq k\leq n:\hat x_k &= \alpha_k * z_k + (1 - \alpha_k) * \hat x_{k-1} \end{aligned}$$


The second way to solve this problem is to take $(2)$ above and turn this into an optimisation problem. In particular, since what we want is to maximise that quantity, we can instead call the negative logarithm an "error function" and minimise that.

$$\begin{aligned} &\log(p(\boldsymbol Z, \boldsymbol X)) \\ &=\log \left[\left[\prod_{k=2}^n\mathcal N(Z_k|X_k, 1/\delta_k^2)*\mathcal N(X_k|X_{k-1}, 1/\lambda^2)\right]*\mathcal N(Z_1|X_1, 1/\delta_1^2)*\mathcal N(X_1|\mu, 1/\sigma^2)\right] \\ &\propto \frac 1 {\sigma^2} (x_1-\mu)^2 + \frac 1 {\lambda^2} \sum_{k=2}^n(x_k-x_{k-1})^2 + \sum_{i=1}^n\frac 1 {\delta_i^2}(x_i-z_i)^2 &(4) \end{aligned}$$


If the above two approaches can be used -- and it might turn out that I've made some assumption or mistake that means they can't -- then they should yield the same result.

Now, let's analyse the case I'm interested in. Suppose that there is some number $\delta$ and a sequence of numbers $\boldsymbol I = \{I_1, I_2, ..., I_n\}$ such that:

$$\forall i: \frac 1 {\delta_i^2} = \frac{I_i}{\delta^2}$$

That is, the precision of the measurement noise is proportional to these $I_i$ and the constant of proportionality is always the same.

Now, I can rewrite my objective function in $(4)$ as:

$$\frac {\delta^2} {\sigma^2} (x_1-\mu)^2 + \frac {\delta^2} {\lambda^2} \sum_{k=2}^n(x_k-x_{k-1})^2 + \sum_{i=1}^nI_i(x_i-z_i)^2$$

The reason this is not strictly the same as $(4)$ is that the $I_i$ can in fact be any nonnegative number, $\forall i:I_i\geq 0$, including zero. So these $I_i$ are effectively a "relative weight" of each measurement I make and some measurements can have weight zero (if I, for example, did not actually make that measurement).

That way, I can in fact rewrite the alphas:

$$\begin{aligned} \alpha_1 &= \frac{I_1}{I_1 + \delta^2 / \sigma^2} \\ \forall k, 2\leq k\leq n:\alpha_k&=\frac{I_k}{I_k+\delta^2/\lambda^2} \end{aligned}$$

Now let's take the following implementation of the BN:

enter image description here

The MAP estimates should be

$$\begin{aligned} \{&\alpha_1 * z_1 + (1 - \alpha_1) * \mu,\\ &\alpha_1 * z_1 + (1 - \alpha_1) * \mu,\\ &\alpha_3 * z_3 + (1 - \alpha_3) * \left(\alpha_1 * z_1 + (1 - \alpha_1 * \mu)\right)\} \end{aligned}$$

However, if I try to optimise the following objective function:

$$\frac {\delta^2} {\sigma^2} (x_1-\mu)^2 + \frac {\delta^2} {\lambda^2} \left[(x_2-x_1)^2 + (x_3-x_2)^2\right] + I_1(x_1-z_1)^2 + I_3(x_3-z_3)^2$$

I don't get those MAP estimates. Take, for example, the case where:

$$\begin{aligned} \mu &= 1 \\ z_1 &= 0.8 \\ z_3 &= 1.2 \\ I_1 &= 400 \\ I_3 &= 500 \\ \frac {\delta^2}{\sigma^2} &= 300 \\ \frac {\delta^2}{\lambda^2} &= 4500 \end{aligned}$$

This would imply:

$$\begin{aligned} \alpha_1 &= \frac 4 7 \\ \alpha_3 &= \frac 1 {10} \\ \hat x_1 &= \frac 4 7 * 0.8 + \frac 3 7 * 1 = 0.89 \\ \hat x_3 &= \frac 1 {10} * 1.2 + \frac 9 {10} * 0.89 = 0.92 \\ \end{aligned} $$

But that's not actually the solution. What am I missing? Where in this derivation did I get it wrong?

1

There are 1 best solutions below

0
On BEST ANSWER

After much thinking and writing things down I figured it out.

The recursive relations implied by $(3)$ are correct for the conditional MAP estimate. That is to say, the values of $\hat x_i$ I wrote out in the "first approach" section of my original post are the values that maximise their respective $p(X_i|Z_i, X_{i-1})$, conditioning on the present observation and the previous value.

The optimisation approach, however, is minimising the negative logarithm of $p(\boldsymbol X|\boldsymbol Z)$ -- the joint posterior over all variables $X_i$, conditional on the set of measurements $\boldsymbol Z$ -- and there's no guarantee that the $i^{th}$ coordinate of the MAP vector $\hat{\boldsymbol X}$ will equal $\hat x_i$ (and, as shown above, it probably almost always doesn't).

This is because, when I'm optimising the joint posterior, I'm using future values to update the past -- in my example, the fact that $z_3=1.2$ was as relevant to $X_1$ as the fact that $z_1=0.8$ was to $X_3$, and so all values in $\boldsymbol X$ are jointly optimised to maximise the posterior.


ETA:

And the actual thing that I want, the MAP/expected value (since they're the same here) of $p(X_i|\boldsymbol Z_{j\leq i})$, is:

$$\begin{aligned} \bar x_0&\equiv \mu\\ \forall i> 0:\bar x_i &\equiv \mathbb E(X_i|\boldsymbol Z_{j\leq i}) \\ &= \alpha_i * z_i + (1 - \alpha_i) * \bar x_{i- 1} \\ \forall i > 0:\alpha_i &=\frac {I_i}{I_i + \delta^2/ (\sigma_{i-1}^2 + \lambda_i^2)} \\ \sigma_0 &=\sigma \\ \lambda_1 &= 0 \\ \forall i > 0 : \sigma_i^2 &= \frac {\delta^2}{I_i + \delta^2/ (\sigma_{i-1}^2 + \lambda_i^2)} \\ \forall i > 1 : \lambda_i &= \lambda \end{aligned}$$

And with the above definitions in mind:

$$\begin{aligned} X_i|\boldsymbol Z_{j\leq i} &\sim \mathcal N(\bar x_i, 1 / \sigma_i^2) \\ \frac{1}{\sigma_i^2} &= \frac{1} {\delta_i^2} + \frac{1}{\sigma_{i-1}^2+\lambda_i^2} \end{aligned}$$

That is, the precision of the distribution of $X_i|\boldsymbol Z_{j\leq i}$ is the precision of $Z_i|X_i$ plus the inverse of the sum of the variances of $X_i|X_{i-1}$ and $X_{i-1}|\boldsymbol Z_{j < i}$.