Calculation of confidence interval for population mean if population not normally distributed

2.8k Views Asked by At

When calculating the confidence interval for a population mean, it's said that, if the population isn't normally distributed, the central limit theorem still allows us to assume that $\overline{X}_n\sim\mathcal{N}(\mu,\sigma/\sqrt{n})$, where $\overline{X}_n=\tfrac 1n\sum_{i=1}^nX_i$ and $\mu,\sigma$ are mean and standard deviation of the population. But, how does the CLT allow that?

Indeed, if we have a sequence $X_1,\dotsc,X_n$ of i.i.d. random variables with $\operatorname{E}(X_i) = \mu$ and $\operatorname{Var}(X_i) = σ^2$ the central limit theorem yields $$ \lim_{n\to\infty}P\left(\frac{\overline{X}_n-\mu}{\sigma/\sqrt{n}}\leq z\right)=\Phi(z)\quad\quad(1) $$ for all $z$. Now I would like to use that result to approximate the cumulative distribution function $\Psi_n(z)$ of $\overline{X}_n$ by $\Phi(z)$ or $\Phi_{\mu,\sigma}(z)$ and thus calculate the confidence interval. As $$ \Psi_n(\tfrac{\sigma}{\sqrt{n}}t+\mu)=P(\overline{X}_n\leq\tfrac{\sigma}{\sqrt{n}}t+\mu), $$ setting $z=\tfrac{\sigma}{\sqrt{n}}t+\mu$ in (1) would yield the desired approximation, if it wasn't for the $n$-dependency of $z$, which actually would require uniform convergence in (1).

So what's the trick or did I get someting wrong?

1

There are 1 best solutions below

4
On BEST ANSWER

The best way to get a good confidence interval is to use everything you know about the population distribution. In the first three examples below, we pretend to have successively less information about the population, so that we can see the effect of knowledge and assumptions on confidence intervals.

Suppose I have $n = 50$ observations from a population safely assumed to be normal, sorted in increasing order, as displayed and summarized below:

sort(x)
 [1]  71  72  73  74  80  82  82  83  84  86  86  87  88
[14]  90  91  91  92  93  94  94  95  95  96  96  97  97
[27]  98  99 100 101 101 101 102 103 103 105 105 106 108
[40] 108 110 110 111 111 112 112 118 121 129 131

summary(x);  sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  71.00   88.50   97.00   97.48  105.75  131.00 
 [1] 13.62656

enter image description here

Estimating mean, knowing data are normal and knowing population SD: If I know these data are from a normal distribution with standard deviation $\sigma = 15,$ then a 95% confidence interval for the population mean $\mu$ is $\bar X \pm 1.96\frac{\sigma}{\sqrt{n}},$ as in your Question. This computes to $(94.63, 100.33).$

pm = c(-1,1);  mean(x) + pm*1.96*sqrt(15/sqrt(50))
[1]  94.62531 100.33469

Estimating mean, assuming normality, but not knowing SD: If I know the data are normal but don't know $\sigma,$ then I can estimate it by the sample standard deviation $S = 13.63,$ and a confidence interval based on Student's t distribution with $\nu = 49$ degrees of freedom is $\bar X \pm t^*\frac{S}{\sqrt{n}}.$ For our data this is the interval $(93.61, 101.35).$

mean(x) + pm*qt(.975,49)*sd(x)/sqrt(50)
[1]  93.60737 101.35263

Estimating mean, assuming randomly sampled data: If I am not willing to assume the data are normal, I can use a nonparametric bootstrap procedure: I know the distribution of $D = \bar X - \mu,$ then I could find cut-off points $L$ and $U$ such that $$.95 = P(L \le D = L - \bar X \le U) = P(\bar X - U \le \mu \le \bar X - L),$$ so that a 95% CI for $\mu$ would be of the form $(\bar X - U, \bar X - L).$

Not knowing the distribution of $D$ I can use a bootstrap procedure to get approximate values $L^*$ and $U^*$ of $L$ and $U,$ respectively. Temporarily, I use $\mu^*= \bar X_{\text{obs}}= 97.48$ as a proxy for $\mu.$ Also, I take a large number $B$ of re-samples of size $n = 50$ with replacement from the data, find $\bar X^*$ of each re-sample and use percentiles .025 and .975 of the $B$ values $D^* = \bar X^* - \mu^*$ and use them for $L^*$ and $U^*,$ respectively. Then, returning $\bar X_{obs}$ to its role as the mean of the data, a 95% nonparametric bootstrap CI for $\mu$ is of the form $(\bar X_{obs} - U^*, \bar X_{obs} - L^*).$

In the R code below, $*$'s representing bootstrap quantities are shown by .re. The bootstrap CI from the run shown is $(93.78,\, 101.18).$ Because this is a random process subsequent runs (with different seeds) will give slightly different CIs.

set.seed(819);  B = 10000;  n = 50;  a.obs = mean(x)
d.re = replicate( B, mean(sample(x, n, repl=T))-a.obs )
UL = quantile(d.re, c(.975,.025))
a.obs - UL
 97.5%   2.5% 
 93.78 101.18 

It is typical for nonparametric bootstrap confidence intervals to be slightly shorter than t confidence intervals. The bootstrap CIs are based only on the data, which does not necessarily capture information about the tails of the normal distribution, which are not 'heavy' but do extend to $\pm\infty.$ [If the data really are normal, we should use a t CI, not a nonparametric bootstrap CI.]

The data were simulated using set.seed(818); x = round(rnorm(50, 100, 15)), so we know $\mu = 100,$ and thus that all three of the CIs above happen to cover the true value of $\mu.$ [Of course, $\mu$ would not be known in a real-world situation.]

Estimating mean of exponential data: Suppose we have $n = 50$ observations from an exponential population with unknown mean $\mu.$

sort(y)
 [1]   1   2   7   7   8   9  10  13  14  21  21  22  23
[14]  24  28  36  45  46  49  52  54  55  60  61  64  65
[27]  66  71  89  91  96 100 128 132 146 152 152 153 159
[40] 162 167 169 174 191 203 236 286 301 456 480
summary(y);  sd(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   23.25   64.50  103.14  152.75  480.00 
[1] 106.5341

enter image description here

For exponential data, it is known that $\bar X/\mu \sim \mathsf{Gamma}(\text{shape}=n,\,\text{rate}=n).$ [This result can be proved using moment generating functions.] Thus $$.95 = P\left(L=0.742 \le \frac{\bar Y}{\mu} \le U = 2.296\right) = P\left(\frac{\bar Y}{U} < \mu < \frac{\bar Y}{L}\right),$$ so that for our data a 95% CI for $\mu$ is $(\bar Y/U,\, \bar Y/L) = (79.60, 138.96).$ [The vector y was simulated according to $\mathsf{Exp}(\mu=100).]$

qgamma(c(.025,.975), 50, 50)
[1] 0.7422193 1.2956120
mean(y)/qgamma(c(.975,.025), 50, 50)
[1]  79.60717 138.96163

By contrast, if we had taken the data to be normally distributed and used a t confidence interval the incorrect result would have been $(72.86, 133.42).$ With a sample as large as $n = 50,$ this t CI is perhaps not catastrophically wrong -- but wrong nevertheless.


Addendum per Comment: The following simulation shows that a "95%" t CI for $\mu$ based on $n = 50$ exponential observations covers less than 93% of the time. (With $n = 25,$ coverage is only about 91%.) It is in that long-run sense that such t CIs are 'incorrect'.

set.seed(819); m = 10^5; n = 50; LCL = UCL = numeric(m)
for (i in 1:m) {
  x = rexp(n, .01); a = mean(x);  s = sd(x)
  LCL[i] = a - 1.96*s/sqrt(n);  UCL[i] = a + 1.96*s/sqrt(n) }
mean((LCL < 100) & (UCL > 100))
[1] 0.92803