I am answering the below question from recitation 20, this course.
The question is:
"In your summer internship, you are working for the world’s largest producer of lightbulbs. Your manager asks you to estimate the quality of production, that is, to estimate the probability $p$ that a bulb produced by the factory is defectless. You are told to assume that all lightbulbs have the same probability of having a defect, and that defects in different lightbulbs are independent.
ii) If you test 50 light bulbs, what is the probability that your estimate is in the range $p ± 0.1$?
I went about it using the central limit theorem (CLT):
Let $D$ be the event a light bulb is defective, and $P(D) = p$. Therefore defective lightbulbs can be modelled as as r.v. $X$ where $X \sim \text{Bern}(p)$. We sample $n=50$ i.i.d draws of the r.v. $X$, $X_1, ..., X_{50}$.
Let $M_n = \frac{\sum_i X_i}{n}$ be the mean of the r.v. $X$. $E[M_n] = p$ and $\text{var}(M_n) = \frac{\text{var}(X_i)}{n}$ with $\text{var}(X_i) = \sigma^2$. $\sigma^2 = p(1-p)$ as the variance of the Bernoulli, as we don't know $p$ we take the maximum possible $\sigma^2 = \frac{1}{4}$ (when $p = \frac{1}{2}$) with $\sigma = \frac{1}{2}$.
Therefore $Z_n = \frac{M_n - p}{\sigma / \sqrt{n}}$, with $Z_N \sim N(0, 1)$ according to the CLT.
According to the question, we want to find $P(p - 0.1 \leq M_n \leq p + 0.1) = P(M_n \leq p + 0.1) - P(M_n \leq p - 0.1)$
Standardising $M_n$ as above:
$ = P(Z_n \leq \frac{0.1}{\sigma / \sqrt{n}}) - P(Z_n \leq \frac{-0.1}{\sigma / \sqrt{n}})$
$ = P(Z_n \leq 0.2\sqrt{50}) - P(Z_n \leq -0.2\sqrt{50})$
($n = 50$ as per the question). Therefore this probability should be:
$\Phi(1.414) - \Phi(-1.414) = 0.921 - 0.078 = 0.843$
I checked the question and it seems we were supposed to answer it with Chebyshev inequality, which gives a bound $\geq 0.5$, which this answer is within but I'm not sure it's correct.
I simulated with the below code, which gives a probability of ~0.79, so I think this answer is at least slightly off.
My questions are:
- is the above approach correct?
- If it is, why is there a discrepancy between the theoretical estimate and the simulation? [EDIT] I guess the distribution of $M_n$ is only well approximated by the normal as $n \rightarrow \infty$ which may explain it. Is there a way to quantify this error?
import numpy as np
n = 50
p = 0.50
MC = 100000
means = np.empty(MC)
for i in range(MC):
X = np.random.rand(n)
X = np.array(X < p, dtype=np.int32)
means[i] = np.mean(X)
in_range = np.logical_and(means > p - 0.1, means < p + 0.1)
prob = np.sum(in_range) / MC
print(prob)