Binomial theorem estimate for very large samples

300 Views Asked by At

I have around $2^{105}$ balls, of which 1 in 20 is white. I expect that when I draw a random sampling of them, roughly 5% of all balls drawn would be white.

What is the probability that, if I draw $2^{95}$ balls out of that set, my percentage of white balls drawn will deviate from the expected percentage by more than, say, a thousandth of a percent, i.e.

$\frac{\lvert actual percentage - .05 \rvert}{.05} > 0.00001$

Calculating this with the binomial theorem seems impossible - a thousandth of a percent of 2^95 is too much to iterate through. I don't need an exact answer. An upper limit would be fine.

2

There are 2 best solutions below

1
On BEST ANSWER

I think using the central limit theorem is a good approach, but you can also use Hoeffding's inequalities to get a bound that's easier to compute. As mentioned in the other answer, it's wise to approximate with a Binomial distribution since $2^{105}$ is much larger than $2^{95}$. I'm going to start out without this assumption though to illustrate how good the approximation is. The inequality one gets without this approximation gives almost exactly the same result.

If we have $N \approx 2^{105}$ balls, of which one out of twenty are white and we draw $n = 2^{95}$ balls without replacement, then the number of white balls drawn will follow a Hypergeometric distribution with parameters $N, K = \frac{N}{20}, n$. Let's call the number of white balls drawn $W$.

Let $p = \frac{K}{N} = \frac{1}{20}$.

There are some nice tail bounds for the Hypergeometric distribution that we can use. For any $t > 0$ we have

$$P(W \leq (p-t)n) \leq \exp(-n\operatorname{D}(p-t||p)) \leq \exp(-nt^2)$$ $$P(W \geq (p+t)n) \leq \exp(-n\operatorname{D}(p+t||p)) \leq \exp(-nt^2)$$

Where $$D(a||b) = a\log\left(\frac{a}{b}\right) + (1-a)\log\left(\frac{1-a}{1-b}\right)$$

is the Kullback-Leibler Divergence. The latter inequality in each expression follows from the inequality $D(a||b) \leq (a-b)^2$. It's nice to note that the weaker inequalities are exactly Hoeffding's Inequalities for the tails of binomial random variables. These are inequalities one could derive if one used a binomial approximation to the hypergeometric distribution.

Through some algebraic manipulation, we can derive the inequality

$$P\left(\left|\frac{W}{N} - p\right| > t\right) \leq \exp(-n\operatorname{D}(p-t||p)) + \exp(-n\operatorname{D}(p+t||p)) \leq$$ $$ 2\exp(-nt^2)$$

In the specific case you mentioned $n = 2^{95}$, $p = \frac{1}{20}$, and $t = \frac{.00001}{20}$.

As mentioned in the other answer, the probability is very near zero. $$2\exp\left(-2^{95}\left(\frac{.00001}{20}\right)^2\right) \approx 0$$

0
On

Since $2^{95}$ is small compared to $2^{105}$, we can make the approximation that we are sampling with replacement: let $$X_1,...,X_n \overset{iid}{\sim}\text{Bernoulli}(p)$$ where $n=2^{95}$, $p=\frac{1}{20}$.
Then their sample mean $\bar{X}:=\frac{1}{n} \sum_{i=1}^n X_i$ has mean and variance: $$E(\bar{X})=p, \sigma^2 := Var(\bar{X})=\frac{p(1-p)}{n}$$ Since the sample size n is so large, we may safely approximate the random variable $\bar{X}$ with a normal random variable with the same mean and variance.
The probability you are interested in approximating is $$P(|\bar{X}-p|>t)$$where in your case $t=5 \cdot 10^{-7}$.
Converting to standard units, we're interested in $$P(|Z|>\frac{t}{\sigma})=2 \cdot P(Z>\frac{t}{\sigma})=2(1-P(Z \leq \frac{t}{\sigma}))$$ where $Z$ is standard normal, and the 1st equality above is from symmetry of the normal curve.
Now the quantity $P(Z \leq s)$ is easily found for any value of s, either by looking at a normal table or using, say, Wolfram alpha. Since $\frac{t}{\sigma}$ is proportional to $\sqrt{n}$, and your n is so large, I suspect the final answer is $\approx 0$.