Is the posterior always a compromise between the prior and the data?

Question

Is the posterior always a compromise between the prior and the data?

243 Views Asked by user61527 At 03 Apr 2026 - 8:02

Suppose that we are interested in learning the proportion of the population $\theta$ with a particular property (for instance, the fraction of the population who are male). Suppose that we randomly sample $n$ members of this population (with replacement, to make things easier) and observe that $y$ of them have the property (so the fraction of the sample with the property is $y/n$). We start with a continuous prior $p(\theta)$ with full support $[0, 1]$ and update this using Bayes rule.

Question: does the expected value of the posterior always lie between the prior expectation and the sample fraction $y/n$?

Comments: I know this is true in the case where my prior takes the form of a beta distribution (with parameters $\alpha$, $\beta$). In that case, we know that the prior expectation is $$ \mathbb{E}[\theta] = \frac{\alpha}{\alpha + \beta} $$

Due to random sampling, the probability that $y$ of the $n$ draws have the property is $$ P(\text{data}\mid\theta) = \binom{n}{y} \theta^y (1 - \theta)^{n-y} $$ which means (see, e.g., here) that the posterior $P(\theta\mid\text{data})$ is also beta distributed and has expected value $$ \mathbb{E}[\theta\mid\text{data}] = \frac{\alpha + y}{\alpha + \beta + n} $$ Moreover, it can be shown that $$ \frac{\alpha + y}{\alpha + \beta + n} \in \left[\frac{y}{n}, \frac{\alpha}{\alpha + \beta}\right] $$ where both inequalities are strict provided that $y/n \neq \alpha / (\alpha+\beta)$. So the posterior expectation does indeed lie between the prior expectation and the sample fraction in this instance.

To now examine the more general case, consider an arbitrary smooth prior $p(\theta)$ with full support on $[0, 1]$ and expected value $\mathbb{E[\theta]}$. By Bayes theorem, the posterior is \begin{split} p(\theta\mid\text{data}) &= \frac{p(\text{data}\mid\theta)P(\theta)}{P(\text{data})} \\ &= \frac{\binom{n}{y} \theta^y (1 - \theta)^{n-y}P(\theta)}{\int_0^1 P(\text{data}\mid\theta)P(\theta) d\theta} \\ &= \frac{\binom{n}{y} \theta^y (1 - \theta)^{n-y}P(\theta)}{\int_0^1 \binom{n}{y} \theta^y (1 - \theta)^{n-y}P(\theta) \, d\theta} \\ &= \frac{\theta^y (1 - \theta)^{n-y}P(\theta)}{\int_0^1 \theta^y (1 - \theta)^{n-y}P(\theta) \, d\theta} \end{split}

and so the posterior expectation is \begin{split} \mathbb{E[\theta}\mid\text{data}] &= \int_0^1 p(\theta\mid\text{data}) \theta \, d\theta \\ &= \int_0^1 \frac{\theta^y (1 - \theta)^{n-y}P(\theta)}{\int_0^1 \theta^y (1 - \theta)^{n-y}P(\theta) \, d\theta} \theta \, d\theta \\ &= \frac{\int_0^1 \theta^y (1 - \theta)^{n-y}P(\theta) \theta \, d\theta}{\int_0^1 \theta^y (1 - \theta)^{n-y}P(\theta) \, d\theta} \\ &= \frac{\mathbb{E}[\theta^{y+1} (1 - \theta)^{n-y}]}{\mathbb{E}[\theta^y (1 - \theta)^{n-y}]} \end{split}

It remains to show that $$ \frac{\mathbb{E}[\theta^{y+1} (1 - \theta)^{n-y}]}{\mathbb{E}[\theta^y (1 - \theta)^{n-y}]} \in \left[ \mathbb{E}[\theta], \frac{y}{n} \right]; $$ but this is where things begin to get tricky. Very heuristically (apologies for what is to come!), one might try to obtain one bound using an argument like \begin{split} \frac{\mathbb{E}[\theta^{y+1} (1 - \theta)^{n-y}]}{\mathbb{E}[\theta^y (1 - \theta)^{n-y}]} &\geq \frac{\mathbb{E}[\theta^{y+1}] \mathbb{E}[(1 - \theta)^{n-y}]}{\mathbb{E}[\theta^y] \mathbb{E}[(1 - \theta)^{n-y}]} \\ &\geq \frac{\mathbb{E}[\theta]^{y+1} \mathbb{E}[(1 - \theta)]^{n-y}}{\mathbb{E}[\theta]^y \mathbb{E}[(1 - \theta)]^{n-y}} \\ &= \mathbb{E}[\theta] \end{split} ...although, quite aside from the very dubious feel to this argument, it doesn't give us the $y/n$ bound. Any help would be much appreciated!

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Nice question! Sadly it is not true. I'll talk in terms of using coin flips to determine the bias $\theta$ of a weighted coin, where $\theta$ is the probability that the coin flips heads. Here's the idea: consider a prior in which we assign very high probability to values of $\theta$ very close to either $0$ or $1$ and very low probability otherwise, such that the prior expectation is $\frac{1}{2}$. Now suppose we see, say, $\frac{2}{3}$ heads in the sample. This is very unlikely if $\theta$ is close to $0$ and much more likely if it's close to $1$, so the posterior distribution is concentrated on values of $\theta$ close to $1$ and in particular the posterior mean is close to $1$, which makes it potentially larger than either the prior expectation $\frac{1}{2}$ or the sample fraction $\frac{2}{3}$.

You can see hints of this already in your beta distribution calculation: the inequalities you want assume that $\alpha$ and $\beta$ are positive, and are false for, say, $\alpha = 0, \beta = - \frac{1}{2}$. Of course the beta integral does not converge in this case but this gives us an idea of what to look for.

Formally, take $H$ (for "height") to be a large positive constant and $w$ (for "width") and $\epsilon$ to be small positive constants, and consider the "triangular" prior with probability density function

$$p(\theta) = \begin{cases} H \left( 1 - \frac{\theta}{w} \right) + \epsilon & \text{ if } 0 \le \theta \le w \\ \epsilon & \text{ if } w \le \theta \le 1 - w \\ H \left( 1 - \frac{1-\theta}{w} \right) + \epsilon & \text{ if } 1 - w \le \theta \le 1. \end{cases}.$$

We have $\int_0^1 p(\theta) \, d \theta = Hw + \epsilon$ so this is a pdf as long as $Hw + \epsilon = 1$. It's symmetric about $\frac{1}{2}$, so the prior expectation $\mathbb{E}(\theta)$ is $\frac{1}{2}$.

Now suppose we flip $3$ coins and $2$ of them are heads. Then the posterior density is the normalization of $\theta^2 (1 - \theta) p(\theta)$, and so the posterior expectation is $\frac{\mathbb{E}(\theta^3(1 - \theta))}{\mathbb{E}(\theta^2(1 - \theta))}$. This is a slightly tedious but doable calculation which I will punt to WolframAlpha; we get

$$\mathbb{E}(\theta^2(1 - \theta)) = \left( -\frac{w^3}{12} + \frac{w^2}{6} \right) H + \frac{\epsilon}{12}$$ $$\mathbb{E}(\theta^3(1 - \theta)) = \left( - \frac{w^5}{15} + \frac{w^4}{5} - \frac{w^3}{4} + \frac{w^2}{6} \right) H + \frac{\epsilon}{20}$$

so the posterior expectation is their quotient. This is a bit annoying to write out in full so let's just talk about how it behaves asymptotically. If $w$ is small then the polynomials in $w$ above are dominated by their terms with smallest exponent, namely $\frac{w^2}{6}$, which in both cases comes from the portion of the integral corresponding to $\theta \approx 1$ and hence $\theta^n(1 - \theta) \approx 1 - \theta$; importantly this portion of the integral approximately does not depend on the exponent $n$ of $\theta$. We have $H = \frac{1 - \epsilon}{w}$ which gives, for both $w$ and $\epsilon$ small,

$$\mathbb{E}(\theta^2(1 - \theta)) = \frac{w}{6} + \frac{\epsilon}{12} + O(w^2 + \epsilon w)$$ $$\mathbb{E}(\theta^3(1 - \theta)) = \frac{w}{6} + \frac{\epsilon}{20} + O(w^2 + \epsilon w)$$

so we see that by taking $w$ to be small but $\epsilon$ to be much smaller we can arrange for the posterior expectation to be arbitrarily close to $1$, and in particular not in the interval $\left[ \frac{1}{2}, \frac{2}{3} \right]$, as expected. To be concrete we can take, say, $w = 0.01, \epsilon = 0.0001$.

If you're looking for a conceptual upshot, one conceptual upshot here is that when the prior is very "lopsided" like this, the prior mean is not a good summary of it, and when the prior is concentrated on two very different hypotheses, apparently small amounts of evidence can tilt the balance between them dramatically.

Edit: We can write down a simpler counterexample if we don't require the prior to have a continuous pdf or full support. Namely, we can take a discrete prior which assigns probability $\frac{1}{2}$ to either $\theta = w$ or $\theta = 1 - w$, where again $w$ is a small positive constant, and assigns $0$ probability otherwise. The prior mean is still $\frac{1}{2}$ by symmetry. Now if we flip $3$ coins and $2$ of them land heads the posterior mean is

$$\frac{w^3 (1 - w) + w (1 - w)^3}{w^2 (1 - w) + w (1 - w)^2} = \frac{w^2 + (1 - w)^2}{w + (1 - w)} = 1 - 2w + 2w^2$$

and we again see, very straightforwardly this time, that as $w \to 0$ the posterior mean gets arbitrarily close to $1$. Here we can actually determine explicitly the values of $w$ for which the posterior mean is strictly larger than $\frac{2}{3}$: it occurs for all $0 < w < \frac{\sqrt{15} - 3}{6} \approx 0.145$.

Without an explicit calculation here we can at least check that the denominator and numerator are both $w + O(w^2)$ and so again have the same growth rate as $w \to 0$.

Is the posterior always a compromise between the prior and the data?

There are 1 best solutions below

Related Questions in PROBABILITY

Related Questions in BAYES-THEOREM

Trending Questions

Popular # Hahtags

Popular Questions