When calculating the confidence interval for a population mean, it's said that, if the population isn't normally distributed, the central limit theorem still allows us to assume that $\overline{X}_n\sim\mathcal{N}(\mu,\sigma/\sqrt{n})$, where $\overline{X}_n=\tfrac 1n\sum_{i=1}^nX_i$ and $\mu,\sigma$ are mean and standard deviation of the population. But, how does the CLT allow that?
Indeed, if we have a sequence $X_1,\dotsc,X_n$ of i.i.d. random variables with $\operatorname{E}(X_i) = \mu$ and $\operatorname{Var}(X_i) = σ^2$ the central limit theorem yields $$ \lim_{n\to\infty}P\left(\frac{\overline{X}_n-\mu}{\sigma/\sqrt{n}}\leq z\right)=\Phi(z)\quad\quad(1) $$ for all $z$. Now I would like to use that result to approximate the cumulative distribution function $\Psi_n(z)$ of $\overline{X}_n$ by $\Phi(z)$ or $\Phi_{\mu,\sigma}(z)$ and thus calculate the confidence interval. As $$ \Psi_n(\tfrac{\sigma}{\sqrt{n}}t+\mu)=P(\overline{X}_n\leq\tfrac{\sigma}{\sqrt{n}}t+\mu), $$ setting $z=\tfrac{\sigma}{\sqrt{n}}t+\mu$ in (1) would yield the desired approximation, if it wasn't for the $n$-dependency of $z$, which actually would require uniform convergence in (1).
So what's the trick or did I get someting wrong?
The best way to get a good confidence interval is to use everything you know about the population distribution. In the first three examples below, we pretend to have successively less information about the population, so that we can see the effect of knowledge and assumptions on confidence intervals.
Suppose I have $n = 50$ observations from a population safely assumed to be normal, sorted in increasing order, as displayed and summarized below:
Estimating mean, knowing data are normal and knowing population SD: If I know these data are from a normal distribution with standard deviation $\sigma = 15,$ then a 95% confidence interval for the population mean $\mu$ is $\bar X \pm 1.96\frac{\sigma}{\sqrt{n}},$ as in your Question. This computes to $(94.63, 100.33).$
Estimating mean, assuming normality, but not knowing SD: If I know the data are normal but don't know $\sigma,$ then I can estimate it by the sample standard deviation $S = 13.63,$ and a confidence interval based on Student's t distribution with $\nu = 49$ degrees of freedom is $\bar X \pm t^*\frac{S}{\sqrt{n}}.$ For our data this is the interval $(93.61, 101.35).$
Estimating mean, assuming randomly sampled data: If I am not willing to assume the data are normal, I can use a nonparametric bootstrap procedure: I know the distribution of $D = \bar X - \mu,$ then I could find cut-off points $L$ and $U$ such that $$.95 = P(L \le D = L - \bar X \le U) = P(\bar X - U \le \mu \le \bar X - L),$$ so that a 95% CI for $\mu$ would be of the form $(\bar X - U, \bar X - L).$
Not knowing the distribution of $D$ I can use a bootstrap procedure to get approximate values $L^*$ and $U^*$ of $L$ and $U,$ respectively. Temporarily, I use $\mu^*= \bar X_{\text{obs}}= 97.48$ as a proxy for $\mu.$ Also, I take a large number $B$ of re-samples of size $n = 50$ with replacement from the data, find $\bar X^*$ of each re-sample and use percentiles .025 and .975 of the $B$ values $D^* = \bar X^* - \mu^*$ and use them for $L^*$ and $U^*,$ respectively. Then, returning $\bar X_{obs}$ to its role as the mean of the data, a 95% nonparametric bootstrap CI for $\mu$ is of the form $(\bar X_{obs} - U^*, \bar X_{obs} - L^*).$
In the R code below, $*$'s representing bootstrap quantities are shown by
.re. The bootstrap CI from the run shown is $(93.78,\, 101.18).$ Because this is a random process subsequent runs (with different seeds) will give slightly different CIs.It is typical for nonparametric bootstrap confidence intervals to be slightly shorter than t confidence intervals. The bootstrap CIs are based only on the data, which does not necessarily capture information about the tails of the normal distribution, which are not 'heavy' but do extend to $\pm\infty.$ [If the data really are normal, we should use a t CI, not a nonparametric bootstrap CI.]
The data were simulated using
set.seed(818); x = round(rnorm(50, 100, 15)), so we know $\mu = 100,$ and thus that all three of the CIs above happen to cover the true value of $\mu.$ [Of course, $\mu$ would not be known in a real-world situation.]Estimating mean of exponential data: Suppose we have $n = 50$ observations from an exponential population with unknown mean $\mu.$
For exponential data, it is known that $\bar X/\mu \sim \mathsf{Gamma}(\text{shape}=n,\,\text{rate}=n).$ [This result can be proved using moment generating functions.] Thus $$.95 = P\left(L=0.742 \le \frac{\bar Y}{\mu} \le U = 2.296\right) = P\left(\frac{\bar Y}{U} < \mu < \frac{\bar Y}{L}\right),$$ so that for our data a 95% CI for $\mu$ is $(\bar Y/U,\, \bar Y/L) = (79.60, 138.96).$ [The vector
ywas simulated according to $\mathsf{Exp}(\mu=100).]$By contrast, if we had taken the data to be normally distributed and used a t confidence interval the incorrect result would have been $(72.86, 133.42).$ With a sample as large as $n = 50,$ this t CI is perhaps not catastrophically wrong -- but wrong nevertheless.
Addendum per Comment: The following simulation shows that a "95%" t CI for $\mu$ based on $n = 50$ exponential observations covers less than 93% of the time. (With $n = 25,$ coverage is only about 91%.) It is in that long-run sense that such t CIs are 'incorrect'.