Finding sample size using Central Limit Theorem

301 Views Asked by At

Q. A researcher wants a single estimate of the probability of monkeys having a disease. Using a state-of-the-art blood scanner, she determines if the monkey has a disease or not. Based on her data, she obtains an estimate, $\widehat{p}$, for the actual probability, $p$, of monkeys having the disease. What is the minimum sample size needed to ensure a $\geq 99\%$ certainty that the difference between $\widehat{p}$ and $p$ is $\leq 10\%$?

Now, I have seen a similar question with a worked solution:

In August 2013, the New York Times reported that a recent poll indicated that 52 percent of the population was in favor of the job performance of President Obama, with a margin of error of $\pm 4$ percent. What does this mean? Can we infer how many people were questioned?

Solution. It has become common practice for the news media to present 95 percent confidence intervals. Since $z_{.025} = 1.96$, a 95 percent confidence interval for $p$, the percentage of the population that is in favor of President Obama’s job performance, is given by:

$$\widehat{p} \pm 1.96\sqrt{(\hat{p}(1-\hat{p})/n} = .52 \pm 1.96\sqrt{52(.48)/n} \tag{1}$$ where $n$ is the size of the sample. Since the “margin of error” is $\pm 4$ percent, it follows that $$1.96\sqrt{.52(.48)/n} = .04 \tag{2}$$ or $$ n = \frac{1.96^2(0.52)(0.48)}{(0.04)^2}=599.29 \tag{3}$$

My questions:

  1. It seems like the only difference in this question is that I don’t actually have the estimate $\hat{p}$ to plug into $(1)$, so can I still solve this question using the approach above?

  2. What is the intuition behind equation $(1)$ and how does this tie into C.L.T.? Yes, I can see that $z_{0.025} = 1.96$ is being used, but where are the remaining pieces, namely $\sqrt{(\hat{p}(1-\hat{p})/n}$, of the puzzle coming from?

1

There are 1 best solutions below

1
On BEST ANSWER

To answer your first question: yes, you can still use the approach you quoted, with one modification: you solve for $n$ in terms of $\hat p$, then maximize this as a function of $\hat p \in (0,1)$. In particular, $$n = \frac{(2.57583)^2 \hat p (1 - \hat p)}{(0.10)^2},$$ since we are using a $99\%$ confidence interval and a desired margin of error of at most $10\%$. Since we know that the graph of $f(x) = x(1-x)$ is maximized when $x = 1/2$, it follows that $n$ is maximized when $\hat p = 1/2$, which tells us that when $n \ge 166$, the sample size will be sufficient to meet the desired criteria no matter what value $\hat p$ ends up being.

For the second question, the central limit theorem is what lets us approximate the confidence interval for a binomial proportion using a normal distribution. If $X \sim \operatorname{Binomial}(n,p)$ counts the number of events of interest (in your case, monkeys with disease), then $\hat p = X/n$ is the sample proportion of monkeys with disease, and $$\operatorname{E}[\hat p] = \operatorname{E}[X/n] = np/n = p,$$ meaning that $X/n$ is an unbiased estimator of the true proportion of monkeys with disease. The variance of this estimator is $$\operatorname{Var}[\hat p] = \operatorname{Var}[X/n] = \operatorname{Var}[X]/n^2 = \frac{np(1-p)}{n^2} = \frac{p(1-p)}{n}.$$ Thus the CLT lets us approximate the sampling distribution of the sample proportion as a normal distribution with mean $\mu = \hat p$ and standard deviation $$\sigma = \sqrt{\frac{\hat p(1-\hat p)}{n}}.$$ Then we compute the confidence interval and sample size in accordance with this approximation, leading to the formulas you provided.

That said, when $n$ is small or $p$ is close to $0$ or $1$, this approach will have problems. In the small-$n$ case, the CLT does not apply; instead, you will want to use a searching method ("plug-and-chug") to compute the minimum $n$ that has the desired properties.