How can we stay confidence replacing the population standard deviation by it's estimate?

119 Views Asked by At

So imagine we take $n$ random samples from a Bernoulli Trial. Thus our random samples are composed by binary random variables $X_1, X_2, ..., X_n$. So by central limit theorem we know that the distribution of $Z=\frac{\overline{X}-p}{\sigma/\sqrt{n}}$ such that $\overline{X}=\frac{X_1+X_2+...+X_n}{n}$ approximates a standard normal pdf when $n$ is big enough. So finding the probability that $Z$ lies between $-1.96$ and $1.96$ is:

$$P(-1.96\le Z\le 1.96)=P(-1.96\le \frac{\overline{X}-p}{\sigma/\sqrt{n}} \le 1.96) = 0.95$$

We also know that the standard deviation of our Binary Random Variable is $\sigma=\sqrt{p(1-p)}$. Thus:

$$P(-1.96\le \frac{\overline{X}-p}{\sqrt\frac{p(1-p)}{n}} \le 1.96) = 0.95$$

The book I'm using just replace $p$ by it's estimate $\overline{X}$ without further explanation. Why can we do that? Thus:

$$P(-1.96\le \frac{\overline{X}-p}{\sqrt\frac{\overline{X}(1-\overline{X})}{n}} \le 1.96) = 0.95$$

So transforming a little we have that:

$$P(\overline{X} -1.96\sqrt\frac{\overline{X}(1-\overline{X})}{n} \le p \le \overline{X} + 1.96 \sqrt\frac{\overline{X}(1-\overline{X})}{n}) = 0.95$$

How can we still saying that this is true with 95% of confidence? what justifies replacing $p$ by $\overline{X}$? I mean the 95% confidence interval is true when we use the population standard deviation and not some estimate. Using an unbiased estimator for the standard deviation population will only tells us that we are going to have a 95% confidence interval in the long run. So, it's the estimator $\overline{X}(1-\overline{X})$ for $\sigma^2$ even unbiased?

3

There are 3 best solutions below

9
On

Yes you can do that, but observations and not random variables. So you don't have $\overline X$, but $\overline x$ (small $x_i$'s). To estimate $p$ (random variable) you use $\hat p=\frac1{n}\cdot \sum\limits_{i=1}^n x_i=\overline x$. We start with $$P(-1.96\le \frac{\overline{X}-\mu}{\frac{\sigma}{ n}} \le 1.96) = 0.95$$

Then we replace $\overline{X}$ by $p$. Both are random variables. And we replace $\mu$ by $\hat p$ and $\sigma$ by $\sqrt{n\cdot \hat p\cdot (1-\hat p)}$

$$P\left(-1.96\le \frac{p-\hat p}{\sqrt{\frac{ \hat p\cdot (1-\hat p)}{ n}}} \le 1.96\right) = 0.95$$

$$P\left(\hat p-1.96\cdot \sqrt{\frac{ \hat p\cdot (1-\hat p)}{ n}}\le p \le \hat p+1.96\cdot \sqrt{\frac{ \hat p\cdot (1-\hat p)}{ n}}\right) = 0.95$$

As written above you can replace the estimator for $p$ by the mean of observations $\overline x$, although it is not a usual notation.

$$P\left(\overline x-1.96\cdot \sqrt{\frac{ \overline x\cdot (1-\overline x)}{ n}}\le p \le \overline x+1.96\cdot \sqrt{\frac{ \overline x\cdot (1-\overline x)}{ n}}\right) = 0.95$$

0
On

In fact, the use of the estimator $\hat p = \bar x$ for the standard error is not mandatory, but a consequence of the normal approximation to the binomial distribution, as explained in the Wikipedia article for the binomial proportion confidence interval. More importantly, the fact that this approximation is only asymptotically valid speaks to the issues with the nominal coverage probability that your question alludes to. The actual coverage probability can perform quite poorly.

If we do not use $\hat p$ in the standard error, this leads to the derivation of the Wilson score interval as explained in the same Wikipedia article, and the result performs better than the Wald (normal) interval.

Finally, the Clopper-Pearson interval, which is constructed from the exact (scaled) binomial distribution of the sample proportion, assures the nominal coverage probability but in doing so, may lead to intervals that actually have much higher than the nominal coverage probability.

0
On

I will divide this answer in three parts: (1) Intuition, (2) Mathematical Proof, and (3) Solution.

(1) INTUITION. So, let's start with an example.

Suppose we have $p=0.25$ and $n=50$. By definition we have that $\sigma_\overline{X}=\sqrt{\frac{0.25 \cdot 0.75}{50}} \approx0.08$. So $\overline{X}$ will be normally distributed with mean $0.25$ and standard deviation $0.08$. Graphically:

enter image description here

So we have a 95% probability that $\overline{X}$ falls between $(0.25-1.96\cdot 0.08, 0.25+1.96\cdot 0.08)=(0.0932,0.4068)$.

Suppose we don't know the population parameters. So the best we can do it's trying to estimate. So, using the estimator $\hat{\sigma}_\overline{X}=\sqrt{\frac{\overline{X}(1-\overline{X})}{50}}$ for $\sigma_\overline{X}$, we would have 2 cases:

  1. $\overline{X} \gt 0.25$, thus $\hat{\sigma}_\overline{X} \gt \sigma_\overline{X}$. Therefore, we would have more than 95% probability that the interval contains p.

  2. $\overline{X} \lt 0.25$, thus $\hat{\sigma}_\overline{X} \lt \sigma_\overline{X}$. Therefore, we would have less than 95% probability that the interval contains p.

So answering the question: No, we aren't going to have a 95% confidence interval anymore when using the estimator for $\sigma_\overline{X}$. We would have more or less than 95% confidence interval depending on the value of $\overline{X}$.

Will this 2 cases probabilities equilibrates in the long run? no, let's show mathematically why:

(2) MATHEMATICAL PROOF Definitely, $\overline{X}(1-\overline{X})$ is biased and here is the proof. But first, suppose $\overline{X}$ is normal distributed with center $p$ and variance $\sigma_\overline{X}^2$. Also, let's remember that:

$$Var(\overline{X})=E[\overline{X}^2]-E[\overline{X}]^2$$ $$\sigma_\overline{X}^2=E[\overline{X}^2]-p^2$$ $$E[\overline{X}^2]=\sigma_\overline{X}^2+p^2$$

Let's take the expected value of $E[\overline{X}(1-\overline{X})]$ and see if is it biased or not:

$$=E[\overline{X}-\overline{X}^2]$$ $$=E[\overline{X}]-[\overline{X}^2]$$ $$=p-[\overline{X}^2]$$ $$=p-\sigma_\overline{X}^2-p^2$$ $$=p(1-p)-\sigma_\overline{X}^2$$

So, $\overline{X}(1-\overline{X})$ is biased by $-\sigma_\overline{X}^2$. With this I can conclude my question, we are not 95% confident that our interval will contain p, in average we would have less confidence than 95%. The only thing we can do is try to reduce the bias $-\sigma_\overline{X}^2$ by increasing the sample size such that we reduce the value of $\sigma_\overline{X}^2=\frac{\sigma^2}{n}$. Also the bias effect is reduced when we maximize p(1-p), when p is close to $0.5$ the bias will have less effect.

(3) SOLUTION We could use other estimators. But the solution for this estimator is just using what we call the margin of error. So let's analyze the function $\sqrt{\frac{\overline{X}(1-\overline{X})}{n}}$. What it's the maximum value it can take? Taking it derivative with respect to $\overline{X}$ we have that the maximum value is when $\overline{X}=0.5$. Thus $\hat{\sigma}_\overline{X}(\overline{X}=0.5)=\frac{1}{2\sqrt{n}}$. So using this value (this value is what we call margin of error) for estimating our interval will always give us a probability higher than 95% of containing the parameter p. So we can say that we are 95% confident that our interval will contain p.