Using Central Limit Theorem to estimate deviation of mean estimator of sampled Bernoulli r.v.

17 Views Asked by At

I am answering the below question from recitation 20, this course.

The question is:

"In your summer internship, you are working for the world’s largest producer of lightbulbs. Your manager asks you to estimate the quality of production, that is, to estimate the probability $p$ that a bulb produced by the factory is defectless. You are told to assume that all lightbulbs have the same probability of having a defect, and that defects in different lightbulbs are independent.

ii) If you test 50 light bulbs, what is the probability that your estimate is in the range $p ± 0.1$?

I went about it using the central limit theorem (CLT):

Let $D$ be the event a light bulb is defective, and $P(D) = p$. Therefore defective lightbulbs can be modelled as as r.v. $X$ where $X \sim \text{Bern}(p)$. We sample $n=50$ i.i.d draws of the r.v. $X$, $X_1, ..., X_{50}$.

Let $M_n = \frac{\sum_i X_i}{n}$ be the mean of the r.v. $X$. $E[M_n] = p$ and $\text{var}(M_n) = \frac{\text{var}(X_i)}{n}$ with $\text{var}(X_i) = \sigma^2$. $\sigma^2 = p(1-p)$ as the variance of the Bernoulli, as we don't know $p$ we take the maximum possible $\sigma^2 = \frac{1}{4}$ (when $p = \frac{1}{2}$) with $\sigma = \frac{1}{2}$.

Therefore $Z_n = \frac{M_n - p}{\sigma / \sqrt{n}}$, with $Z_N \sim N(0, 1)$ according to the CLT.

According to the question, we want to find $P(p - 0.1 \leq M_n \leq p + 0.1) = P(M_n \leq p + 0.1) - P(M_n \leq p - 0.1)$

Standardising $M_n$ as above:

$ = P(Z_n \leq \frac{0.1}{\sigma / \sqrt{n}}) - P(Z_n \leq \frac{-0.1}{\sigma / \sqrt{n}})$

$ = P(Z_n \leq 0.2\sqrt{50}) - P(Z_n \leq -0.2\sqrt{50})$

($n = 50$ as per the question). Therefore this probability should be:

$\Phi(1.414) - \Phi(-1.414) = 0.921 - 0.078 = 0.843$

I checked the question and it seems we were supposed to answer it with Chebyshev inequality, which gives a bound $\geq 0.5$, which this answer is within but I'm not sure it's correct.

I simulated with the below code, which gives a probability of ~0.79, so I think this answer is at least slightly off.

My questions are:

  1. is the above approach correct?
  2. If it is, why is there a discrepancy between the theoretical estimate and the simulation? [EDIT] I guess the distribution of $M_n$ is only well approximated by the normal as $n \rightarrow \infty$ which may explain it. Is there a way to quantify this error?
import numpy as np

n = 50
p = 0.50

MC = 100000
means = np.empty(MC)
for i in range(MC):
    X = np.random.rand(n)
    X = np.array(X < p, dtype=np.int32)
    means[i] = np.mean(X)

in_range = np.logical_and(means > p - 0.1, means < p + 0.1)

prob = np.sum(in_range) / MC

print(prob)