Simple example of "Maximum A Posteriori"

17.7k Views Asked by At

I've been immersing myself into Bayesian statistics in school and I'm having a very difficult time grasping argmax and maximum a posteriori. A quick explanation of this can be found: https://www.cs.utah.edu/~suyash/Dissertation_html/node8.html

Basically theta is a set of parameters and x is the data, P of theta given x (posterior) equals P of x given theta (likelihood term) multiplied by P of theta (prior term) and all of that divided by P of x (to normalize). Not sure exactly how dividing by P(x) normalizes this but thats not my main question.

enter image description here

Then, you maximize the posterior with argmax

enter image description here

I believe your maximizing the set of parameters to get the most likely posterior ...

Can someone please give a simple example of this so I can visualize what is happening?

2

There are 2 best solutions below

1
On BEST ANSWER

Imagine you sent a message $S$ to your friend that is either $1$ or $0$ with probability $p$ and $1-p$, respectively. Unfortunately that message gets corrupted by Gaussian noise $N$ with zero mean and unit variance. Then what your friend would receive is a message $Y$ given by

$$Y = S + N$$

Given that what your friend observed was that $Y$ takes a particular value $y$, that is $Y = y$, he wants to know which was, probably, the value of $S$ that you sent to him. In other words, the value $s$ that maximizes the posterior probability

$$P(S = s \mid Y = y)$$

That last sentence can be written as

$$\hat{s} = \arg\max_s P(S = s \mid Y = y)$$

What follows is to compute $P(S = s \mid Y = y)$ for $S=1$ and $S=0$ and then to pick the value of $S$ for which that probability is greater. We are calling that value $\hat{s}$.

It is sometimes easier to model the uncertainty about a consequence given a cause than the other way around, namely the distribution of $Y$ given $S$, $f_{Y \mid S}(y \mid s)$, rather than $P(S = s \mid Y = y)$. So, let's find out first what is the former to be worried later about the latter.

Given that $S=0$, $Y$ becomes equal to the noise $N$, and therefore

$$f_{Y \mid S}(y \mid 0) = \frac{1}{\sqrt{2\pi}}e^{-y^2/2}\tag{1}$$

Given that $S=1$, $Y$ becomes $Y = N + 1$ , which is just $N$ but "displaced" by $1$ unit, therefore it is also a Gaussian random variable with unit variance but with mean now equal to $1$, thus

$$f_{Y \mid S}(y \mid 1) = \frac{1}{\sqrt{2\pi}}e^{-(y-1)^2/2}\tag{2}$$

How do we compute now $P(S = s \mid Y = y)$? By using Bayes's rule, we have

\begin{align} P(S = 0 \mid Y = y) &= \frac{f_{Y\mid S}(y \mid 0)P(S = 0)}{f_Y(y)}\\ \end{align}

\begin{align} P(S = 1 \mid Y = y) &= \frac{f_{Y\mid S}(y \mid 1)P(S = 1)}{f_Y(y)}\\ \end{align}

We would get $\hat{s}=1$ if

$$P(S = 1 \mid Y = y) \gt P(S = 0 \mid Y = y)$$

or equivalently if

$$f_{Y\mid S}(y \mid 1)p \gt f_{Y\mid S}(y \mid 0)(1-p)\tag{3}$$

This last expression wouldn't help too much to your friend, what he really needs is a criterion based on the value of $Y$ he observed and the known statistics. To achieve that it's possible that what follows makes this example no longer simple, but let's give it an opportunity.

Replacing $(1)$ and $(2)$ in $(3)$ and taking the natural logarithm at both sides, we get

$$-\frac{(y-1)^2}{2}+\text{log}(p) \gt -\frac{y^2}{2}+\text{log}(1-p)$$

which can be simplified to

$$y \gt \frac{1}{2} + \text{log}\left( \frac{1-p}{p} \right)\tag{4}$$

Now this is more helpful. Your friend just has to check if the observed value of $y$ satisfies that inequality to decide if $S=1$ was sent or not. In other words, if the observed value $y$ satisfies $(4)$, then the value that maximizes the posterior probability $P(S = s \mid Y = y)$ is $S=1$, and therefore $\hat{s} = 1$.


Aside note:
The result given by $(4)$ is quite intuitive. If $0$ and $1$ are equiprobable, i.e. $p=1/2$, we would choose $S=1$ if $y > 1/2$. That is, we put our threshold right in the middle of $0$ and $1$. If $1$ is more probable ($p \gt 1/2$), then the threshold in now closer to $0$, thus favoring $S=1$, which makes sense because it the most probable one.

0
On

Here is a simple example that is actually useful. Suppose I am able to observe a random process whose outcomes I am modeling with a Bernoulli distribution; i.e., let $X_i \mid \theta \sim \operatorname{Bernoulli}(\theta)$ be IID with unknown parameter $\theta$, and I am interested in estimating this parameter, which I treat in the Bayesian framework as a random variable. The conjugate prior for $\theta$ is a beta distribution with hyperparameters $a, b$; i.e., we suppose that $\theta$ is a random variable with density $$\pi(\theta) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \theta^{a-1} (1-\theta)^{b-1}, \quad 0 < \theta < 1$$ representing the "true" probability of observing $X_i = 1$. The likelihood of observing the sample $\boldsymbol x = (x_1, \ldots, x_n)$ is $$f(\boldsymbol x \mid \theta) = \prod_{i=1}^n \Pr[X_i = x_i \mid \theta] = \theta^{\sum x_i} (1 - \theta)^{n-\sum x_i},$$ or in terms of the sufficient statistic $\bar x = \frac{1}{n} \sum_{i=1}^n x_i$, we can write this as $$f(\bar x \mid \theta) = \theta^{n \bar x} (1-\theta)^{n(1-\bar x)}.$$ Our posterior distribution for $\theta$, given the sample $\boldsymbol x$ (or equivalently, the sample mean $\bar x$), is therefore proportional to $$f(\theta \mid \bar x) = \frac{f(\bar x \mid \theta)\pi(\theta)}{f(\bar x)} \propto f(\bar x \mid \theta) \pi (\theta) \propto \theta^{n\bar x + a-1}(1-\theta)^{n(1-\bar x) + b-1}.$$ This in fact shows that the choice of prior is indeed conjugate, since the posterior distribution is also beta, but the posterior hyperparameters are now $$a^* = n\bar x + a, \quad b^* = n(1-\bar x) + b.$$

Here is an application of the above with actual numbers. Suppose I use a uniform prior, corresponding to $a = b = 1$. I observe the sample $\boldsymbol x = (1, 1, 0, 1, 0, 1)$ with $n = 6$. Then $\bar x = 4/6 = 2/3$ is the sample proportion of successes, consequently my posterior belief for the distribution of the parameter $\theta$ given the sample I observed is no longer uniform but Beta with hyperparameters $$a^* = 4+1 = 5, \quad b^* = 2+1 = 3.$$ The expectation is $$\operatorname{E}[\theta \mid \boldsymbol x] = \frac{a^*}{a^* + b^*} = \frac{5}{5+3} = \frac{5}{8},$$ which we could take as one point estimate of our belief of the true value of $\theta$. Note, however, that this is not the same as the frequentist maximum likelihood estimate: $$\hat \theta_{\rm ML} = \bar x = \frac{2}{3}.$$ It is also not the same as the posterior mode, which is the mode of the beta distribution: $$\tilde \theta \mid \boldsymbol x = \frac{a^*-1}{a^*+b^*-2} = \frac{4}{6} = \frac{2}{3},$$ which happens to be equal to the ML estimate (and it is easy to see mathematically that the posterior mode equals the ML estimate whenever the prior is uniform).

Some comments: note we never bothered to compute the marginal/unconditional distribution $f(\bar x)$ or $f(\boldsymbol x)$. This is related to the idea of looking only at the proportionality; i.e., we can look at the posterior likelihood if our desire is to find the choice of parameter that maximizes that likelihood (i.e., find the posterior mode). In our case, we have the added benefit of being able to see that the prior is conjugate, so we know that the posterior is beta, making the calculation of the marginal unnecessary since all it will do is tell us the factor by which we should divide in order to obtain a density function rather than a likelihood for $\theta$.