How fast does the beta distribution converge to mean of a Bernoulli random variable?

557 Views Asked by At

(not sure if the title is mathematically correct, but hopefully the following description is clear)

Assume we are drawing samples from a Bernoulli distribution with an unknown parameter $p$, $X_n\sim \text{Bernoulli}(p)$, and using the samples to update a beta distribution on the true value of $p$.

The quantity $\hat p=\frac{\alpha}{\alpha+\beta}$ should approach $p$ as the number of samples $n$ increases.

Question: How fast does $\hat p$ converge to $p$ as a function of $n$ (may also depend on other quantities: prior, true parameter)?

1

There are 1 best solutions below

6
On BEST ANSWER

We have a sequence of independent variables $X_1, X_2, X_3 \ldots \sim \text{Bernoulli}(p),$ and are interested in using the sequence of random variables $ Y_n = \frac{X_1 + X_2 + \ldots + X_n}{n}$ (which is the quantity $\hat{p}$ in your post) as an estimator of the true parameter $p.$ The Law of Large Numbers indicates that $\mathbb{E}[Y_n]$ converges almost surely to $p,$ and your question asks for quantitative bounds on this.

In other words, you know that the distribution of $Y_n$ becomes very concentrated near the mean $\mathbb{E}[Y_n]=p$ but you are looking for precise bounds on this concentration given by inequalities of the form $\mathbb{P}(|Y_n-p| < \epsilon) > t$ where $\epsilon>0$ represents an error interval on your knowledge of $p$ and $t$ is a threshold probability. These are known as Concentration Inequalities.

The most commonly known of these is Chebyshev's inequality, which applied to this situation gives $$ \mathbb{P} \left( |Y_n - p| > k \sqrt{ \frac{p(1-p)}{n}} \right) \leq 1/k^2.$$ Noting that $p(1-p)$ takes a maximum value of $1/4$ at $p=1/2$ we can say

$$ \mathbb{P} \left( |Y_n - p| > \frac{k}{2\sqrt{n}} \right) \leq 1/k^2.$$

For a threshold probability of $t$ we can take $k=\frac{1}{\sqrt{1-t}}.$ To bound $p$ to an error within $\epsilon$ we must have $\frac{k}{2\sqrt{n}} < \epsilon$ or $n > \frac{1}{4\epsilon^2(1-t)}.$ For example, if we take $n=2500$ samples then $Y_{2500}$ has at least probability $t=0.96$ of being within $\epsilon = 0.05$ of the true parameter $p.$

Much better estimates are given by Hoeffding's inequality, which in our situation gives the inequality $$\mathbb{P}(|Y_n - p| > \epsilon) < 2 \exp( - 2n \epsilon^2).$$ From this we deduce that $$\mathbb{P}(|Y_n - p| < \epsilon) > t$$ for $n > \frac{1}{2\epsilon^2} \log(2/(1-t)).$ So actually we only need $n=783$ samples before we know $p$ to within $0.05$ with probability at least $0.96$. Also, note that this bound on $n$ does not depend on $p.$

Some extra comments:

  1. $ p \approx 1 - \epsilon'$ : If we make an additional assumption that $p$ is quite close to $1$ then we can use Chebyshev's inequality as we did above but with a better bound for $p(1-p)$ to get a bound on $n$ tighter than the one given by Hoeffding's inequality. The crossover point where this is fruitful is something around $p > 0.925.$
  2. $ p \approx 1/2 + \epsilon'$ : The Central Limit Theorem states that $\sqrt{\frac{n}{p(1-p)}} ( Y_n - p)$ conveges in distribution to a standard normal. A bound on how quickly this occurs is given by the Berry-Esseen Theorem. If we make an additional assumption that $p$ is not too far from $1/2$ then the convergence is rapid enough that the Berry-Esseen theorem can give a tighter bound on $n$ for a given $t, \epsilon$ than the one given by Hoeffding's inequality. The reason we need $p$ close to $1/2$ is to control the term $$\rho/\sigma^3 = \frac{p^2 + (1-p)^2}{\sqrt{p(1-p)}}.$$ This grows quite large as we go away from $1/2.$ Some back of the envelope calculations suggest that if we assume that $p<0.6$ then for $t= 0.96, \ \epsilon = 0.05$ we get $n\approx 500$ with this method, a little better than $n=783$ that we had from Hoeffding's inequality.
  3. It can be shown that it is not possible to improve these bounds too much further. Statements quantifying this would be of the form $\mathbb{P}(|Y_n-p| > \epsilon) > t,$ which is an example of an "Anti-concentration inequality".

Response to comments below:

There are two factors at play here which affect how tight our bounds are. The first is what is actually true about the distribution - i.e. what actually is inherently the minimum number of samples we must collect from the distribution before we could say we know $p$ to within $\epsilon$ with at least $t$ probability. We are trying to find this number but if the true number for a certain $t,\epsilon$ is $400,$ then no matter what estimates we try to make, it can't improve past that point. If the distribution of $Y_n$ is very spread out (in this case, $p$ near $1/2$), we inherently require more samples to determine $p$ to a certain accuracy than if $Y_n$ was very concentrated ($p$ near $1$ or $0$).

The second factor that affects how tight our estimate is our skill at estimating the distribution of $Y_n$ (in particular, you much of the distribution is concentrated near its mean). Using a more sophisticated analysis will let you make better estimates. Chebyshev's inequality is a very simple bound on how concentrated a distribution must be around its mean, but because of that it is not always the strongest statement. The situations where it becomes a stronger statement are precisely when $\sigma$ is very small, which you can see from the inequality itself : $$\mathbb{P}( |X - \mu | > k \sigma) \leq 1/k^2.$$ All other things being equal, if $\sigma$ becomes smaller then the inequality becomes a stronger statement. That is why we should expect this inequality to become a strong statement when $p$ is near $0$ or $1$ (i.e when the variance $p(1-p)$ is small).

To summarize a bit:

  • Hoeffding's inequality, by carrying out a more detailed analysis, is able to give a better estimate than Chebyshev's inequality for most values of $p.$

  • When $p$ is near $0$ or $1,$ in relation to the first reason above, due to the very small variance it is inherently true that fewer samples are required, and Chebyshev's inequality is able to capture this nature to give better bounds than Hoeffding's.

  • When $p$ is near $1/2,$ the variance is high and the true minimum number of samples is higher, but related to the second reason above, we can get a better bound using the Berry-Esseen theorem because our ability to estimate the distribution becomes much better (because in this case, $Y_n$ rapidly starts behave like a normal distribution by the Central Limit Theorem).

PS - As stated on the Wiki page, the constant $C$ in the Berry-Esseen inequality can be taken to be $C=1/2.$ And yes, the statements of $p$ being near $1/2$ should be two sided. I thought that you were assuming that $p\geq 1/2$ as you stated in another related question of yours, but I just realised that you didn't make that assumption here.


Response to final comment:

I assume you have the viewpoint that there is some true parameter $p$ and we wish to determine this in some way. One approach, the one discussed in this answer, is running enough trials $n$ until we know that $p$ lies within $\epsilon$ distance from $Y_n$ with at least probability $t.$

Another approach is modelling the potential $p$ with a Beta distribution and then compute the posterior distribution after viewing some samples. With a specific such posterior $f_n$, then we could calculate the area for $x$ between $(\mu - \epsilon, \mu+\epsilon)$ : $A = \int^{\mu + \epsilon}_{\mu-\epsilon} f_n(t) dt$ to make a statement "The probability that $p$ lies within $\epsilon$ distance of $\mu$ is $A$ ".

The first approach gives statements that are true no matter what. The second approach assumes an additional belief that the possible $p$ is distributed like a Beta distribution. An additional complication is that you don't actually know what the posterior after $n$ samples will be. If you assume, like in your other question, that you yourself know the true $p$ and you are playing some game against an adversary, then you would know the range of your adversary's possible posterior distributions, and with what probability that they have each of those distributions. So for example if you know they will make some decision/computation after $n$ samples based on their best estimate of $p$ at the time, then you can compute the expected outcome of their decision. If you could describe in maximum detail what scenario you are imagining in your two person game in the other question, I can try to elaborate there.