Confidence Interval Intuition Conflict

806 Views Asked by At

I am trying to understand confidence interval CI, from simplest article I could find about it. I got to an extent and then stuck at crucial moment. Suppose if we decide confidence level we want is 95%, then

95% of all "95% Confidence Intervals" will include the true mean.

This is usually many people infer wrong (assuming a confidence interval says that there is 95% chance that true mean lies within it). I could avoid this trap by focusing on above highlighted sentence. However, when I dig step by step, I get caught, how it could be the case.

  1. Suppose I have a population distribution $Y$ with $\mu$ and $\sigma$. For brevity let it be already normal.
  2. I take 1st sample set of size $n$, denoted by $n1$, described by random variable $X_1 = {x_1,x_2,\cdots, x_{n1}}$, samples picked from population. I find mean $\overline{X_1} = \frac{x_1 + x_2 + \cdots + x_{n1}}{n1}$ and variance $S_1$. For a moment, lets say its normal.
  3. Similary 2nd sample set of size $n$, denoted by $n2$, described by random variable $X_2 = {x_1,x_2,\cdots, x_{n2}}$, samples picked from population. I find mean $\overline{X_2} = \frac{x_1 + x_2 + \cdots + x_{n2}}{n2}$ and variance $S_2$. Again we assume its normal.

I decide I want confidence level of 95%.

  1. If I transfer my population distribution to Z standard deviation, then 95% area occurs at $Z= \pm 1.96$. Since $Z = \dfrac {Y-\mu}{\sigma}$, in original population distribution, 95% data points fall within $Y = \mu \pm 1.96\sigma$. $$ \color{blue}{\Pr(\mu-1.96\sigma < Y < \mu+1.96\sigma) = 0.95} \tag{1} $$
  2. If I transfer my sample set n1 to Z standard (caz assuming its normal), again, 95% of $n1$ data points fall within $\overline{X_1} \pm 1.96S_1$ $$ \color{blue}{\Pr(\overline{X_1}-1.96S_1 < X_1 < \overline{X_1}+1.96S_1) = 0.95} \tag{2} $$
  3. If I transfer my sample set $n2$ to Z standard, again, 95% of $n2$ data points fall within $\overline{X_2} \pm 1.96S_2$ $$ \color{blue}{\Pr(\overline{X_2}-1.96S_2 < X_2 < \overline{X_2}+1.96S_2) = 0.95} \tag{3} $$
  4. Obviously, I would take many sample sets $n3,n4,n5, \cdots nk$ so my eventual sampling distribution of sample means, described by random variable $X$, would be normal, with mean $\overline{X} \rightarrow \mu$ and standard deviation, $S \rightarrow \dfrac{\sigma}{\sqrt{n}}$ $$ \color{blue}{\Pr(\overline{X}-1.96S < X < \overline{X}+1.96S = 0.95} \tag{4} $$ $$ \color{blue}{\Pr(\mu-1.96\dfrac{\sigma}{\sqrt{n}} < X < \mu+1.96\dfrac{\sigma}{\sqrt{n}}) = 0.95} \tag{5} $$

My questions:

  1. Each sample set $n_k$ has its own interval derived from its mean $\overline{X_k}$ and variance $S_k$. How come when I take many of them, suddenly we would say, 95% of all those individual confidence intervals will contain true population mean $\mu$? What is the missing link here?Below is my derivation, is it correct and can we say because of that, it is thus proved, 95% CIs will have $\mu$?

From eq. $5$,
$\Pr(\mu-1.96\dfrac{\sigma}{\sqrt{n}} < X < \mu+1.96\dfrac{\sigma}{\sqrt{n}}) = 0.95$

Adding $-\mu$ on both sides of inequalities,..
$\Pr(-\mu + \mu-1.96\dfrac{\sigma}{\sqrt{n}} < -\mu + X < -\mu + \mu+1.96\dfrac{\sigma}{\sqrt{n}}) = 0.95$
$\Pr(-1.96\dfrac{\sigma}{\sqrt{n}} < X - \mu < 1.96\dfrac{\sigma}{\sqrt{n}}) = 0.95$

Adding $-X$ on both sides of inequalities.. $\Pr(-X-1.96\dfrac{\sigma}{\sqrt{n}} < -X+X - \mu < -X+1.96\dfrac{\sigma}{\sqrt{n}}) = 0.95$
$\Pr(-X-1.96\dfrac{\sigma}{\sqrt{n}} < - \mu < -X+1.96\dfrac{\sigma}{\sqrt{n}}) = 0.95$

Multiplying by $-1$ on both sides of inequalities.. $\Pr(X+1.96\dfrac{\sigma}{\sqrt{n}} > \mu > X-1.96\dfrac{\sigma}{\sqrt{n}}) = 0.95$

This is same as,..
$$\color{blue}{ \Pr(X-1.96\dfrac{\sigma}{\sqrt{n}} < \mu < X+1.96\dfrac{\sigma}{\sqrt{n}}) = 0.95 \tag{6} } $$

Eq. $6$ simply means, when we take enormous no of samples to arrive at sampling distribution of sample means described by $X$, probability of $\mu$ within intervals $X \pm 1.96\dfrac{\sigma}{\sqrt{n}}$ is 95%.
Also, 95% of sample mean values $\overline{X_k}$ values fall within this same interval $X \pm 1.96\dfrac{\sigma}{\sqrt{n}}$.
Because of this can we also say, the 95% of CIs associated with $\overline{X_k}$ also fall within this same interval $X \pm 1.96\dfrac{\sigma}{\sqrt{n}}$?
I think am also approaching with a narrowing missing link. Kindly help here.

  1. Since there are many sample sets to be calculated to arrive at sampling distribution, do we divide by $n$ or $n-1$ (unbiased), for each sample set? (as they will influence CI calculation)

  2. What happens to above questions, when I do not have normal distribution to start with for population ? (Instead say, random or uniform or bernoulli). The eventual sampling distribution might be normal, but we are talking about few sample sets in the beginning for which we calculate confidence intervals for. I ask this, because intermediate Z transformation I said earlier would not be possible, as those sample sets may not have normal distribution.

4

There are 4 best solutions below

15
On BEST ANSWER

While the article you refer to correctly defines the concept of confidence interval (your highlighted text) it does not correctly treat the case of a normal distribution with unknown standard deviation. You may want to search "Neyman confidence interval" to see an approach that produces confidence intervals with the property you highlighted.

The Neyman procedure selects a region containing 95% of outcomes, for each true value of the parameter of interest. The confidence interval is the union of all parameter values for which the observation is within the selected region. The probability for the observation to be within the selected region for the true parameter value is 95%, and only for those observations, will the confidence interval contain the true value. Therefore the procedure guarantees the property you highlight.

If the standard deviation is known and not a function of the mean, the Neyman central confidence intervals turn out to be identical to those described in the article.


Thank you for the link to Neyman's book - interesting to read from the original source! You ask for a simple description, but that is what my second paragraph was meant to be. Perhaps a few examples will help illustrate: Example 1 and 1b could be considered trivial, whereas 2 would not be handled correctly by the article you refer to.

Example 1. Uniform random variable. Let X follow a uniform distribution, $$f(x)=1/2 {\mathrm{\ \ for\ \ }}\theta-1\le x\le \theta+1 $$ and zero otherwise. We can make a 100% confidence interval for $\theta$ by considering all possible outcomes $x$, given $\theta$, ie. $x \in [\theta-1,\theta+1]$. Now consider an observed value, $x_0$. The union of all possible values of $\theta$ for which $x_0$ is a possible outcome is $[x_0-1,x_0+1]$. That is the 100% confidence interval for $\theta$ for this problem.

Example 1b. Uniform random variable. Let X follow the same uniform distribution. We can make a 95% central confidence interval for $\theta$ by selecting the 95% central outcomes $x$, given $\theta$, ie. $x \in [\theta-0.95,\theta+0.95]$. Now consider an observed value, $x_0$. The union of all possible values of $\theta$ for which $x_0$ is within the selected range is $[x_0-0.95,x_0+0.95]$. That is the 95% confidence interval for $\theta$ for this problem.

Example 2. Uniform random variable. Let X follow a uniform distribution, $$f(x)=1/\theta {\mathrm{\ \ for\ \ }}{1\over2}\theta \le x \le {3\over2}\theta $$ and zero otherwise. We can make a 100% confidence interval for $\theta$ by considering all possible outcomes $x$, given $\theta$, ie. $x \in [{1\over2}\theta,{3\over2}\theta]$. Now consider an observed value, $x_0$. The union of all possible values of $\theta$ for which $x_0$ is a possible outcome is $[{2\over3}x_0,2x_0]$. That is the 100% confidence interval for $\theta$ for this problem. (You can confirm this by inserting the endpoints of the confidence interval into the pdf and see they are at the boundaries of the pdf). Note that the central confidence interval is not centered on the point estimate for $\theta$, $\hat\theta = x_0$.

Example 3. Normal distribution with mean $\theta$ and standard deviation $1$. The 68% central confidence interval would be constructed identically to example 1, that is the selected region for $X$ would be $[\theta-1,\theta+1]$. The 68% central confidence interval is therefore the same as in Example 1, $[x_0-1,x_0+1]$. You can extend this to 95% and arbitrary KNOWN standard deviation $\sigma$ to be $[x_0-1.96\sigma,x_0+1.96\sigma]$.

Example 4. Normal distribution with mean $\theta$ and standard deviation $\theta/2$. The 68% central confidence interval would be constructed identically to example 2. The 68% central confidence interval for $\theta$ is therefore the same as in Example 2, $[{2\over3}x_0,2x_0]$.

The authors of the article you refer to and the other commenters to your question would not get Example 2 or 4 right. Only following a procedure like Neyman's will the confidence interval have the property that you highlighted in your post. The other methods are approximations for the general problem of building confidence intervals.

The exact solution to the problem with a normal distribution and UNKNOWN standard deviation is more difficult to work out than the examples above.

6
On

So I'll answer your third question first. You are looking for the distribution of observations (or test statistics) that are likely to occur given the true parameter, $\mu$. So even if the data does not follow a Normal Gaussian Distribution, you are still trying to calculate the test statistic and, by the Central Limit Theorem, the distribution of test statistics will converge (by distribution) to the Normal Distribution, centered around true parameter $\mu$ with a deviation of true $\frac{\sigma}{\sqrt(n)}$. So, by theorem, we have a Normal Distribution with such parameters and observe how much our test statistic deviates from the true parameter.

Since we don't know the true parameters, we create the Normal Distribution centered around our test statistic with the standard deviation of the data divided by $\sqrt(n)$. We do this with the hope that the power, $\beta$, and significance level, $\alpha$ of such an interval would capture the true parameter within $1 - \alpha$ deviation from our test statistic. In this case $\alpha = .05$ and $1 - \alpha = .95$, our test statistic is $\bar X$ or $\hat \mu$ and true parameters $\mu$ and $\sigma$.

So in our first test, we collect data and find $\bar X_1$ with $\hat \sigma$ and construct a 95% CI for it, in our second test we collect new data and find $\bar X_2$ with $\hat \sigma$ and construct another 95% CI for it, the third test we collect new data and find $\bar X_3$ with $\hat \sigma$. If we conduct infinite tests by always collecting new data, 95 % of such intervals should contain true $\mu$. This is because of the CLT and the choice to cast a "net" of 95 % for each $\bar X$ (and 5 % will miss it).

That should also answer question 1. I'm not sure what you mean by question 2, unless you mean to divide $\sigma$ by $\sqrt(n-1)$ rather than $\sqrt(n)$. But I'm pretty sure that doesn't have to do with degrees of freedom so much as it does with solving the Fisher Information $n$ times.

Ultimately, there are several ways to understand confidence intervals (and related to it, p-values). And you should keep looking it up but know how to distinguish when one source takes a different approach from another. My approach is based on my stats professor, Anthony Donoghue, (though I may have misunderstood him, so if there is anything wrong, it's the fault of my inability to pay attention) and a textbook $All$ $of$ $Statistics$ by Larry Wasserman.

7
On

Let me address your question item by item:

  1. If I transfer my population distribution to Z standard deviation, then 95% area occurs at $Z= \pm 1.96$. Since $Z = \dfrac {Y-\mu}{\sigma}$, in original population distribution, 95% data points fall within $Y = \mu \pm 1.96\sigma$. $$ \color{blue}{\Pr(\mu-1.96\sigma < Y < \mu+1.96\sigma) = 0.95} \tag{1} $$

This is correct.

  1. If I transfer my sample set $n_1$ to Z standard (caz assuming its normal), again, 95% of $n_1$ data points fall within $\overline{X_1} \pm 1.96S_1$ $$ \color{blue}{\Pr(\overline{X_1}-1.96S_1 < X_1 < \overline{X_1}+1.96S_1) = 0.95} \tag{2} $$

This is problematic because you have not defined the meaning of $X_1$. You have defined a sample mean $\bar X_1 = (Y_1 + \cdots + Y_{n_1})/n_1$, but it is not clear what you mean by $X_1$. Moreover, $\bar X_1$ and $S_1$ are statistics, and as such are random variables, not parameters.

A correct statement would be something like $$\Pr\left[\mu - 1.96 \frac{\sigma}{\sqrt{n_1}} < \bar X_1 < \mu + 1.96 \frac{\sigma}{\sqrt{n_1}}\right] = 0.95, \tag{2a}$$ where here we have used the fact that $$X_1 \sim \operatorname{Normal}(\mu, \sigma/\sqrt{n_1}),$$ being the sample mean of $n_1$ independent and identically distributed normal random variables with mean $\mu$ and standard deviation $\sigma$.

Another correct statement is $$\Pr\left[\bar X_1 - t^*_{n_1-1,0.975} \frac{S_1}{\sqrt{n_1}} < \mu < \bar X_1 + t^*_{n_1-1,0.975} \frac{S_1}{\sqrt{n_1}}\right] = 0.95, \tag{2b}$$ where $t^*_{n_1-1,0.975}$ is the critical value of the Student's $t$ distribution with $n_1 - 1$ degrees of freedom; i.e., it is the $97.5\%$ quantile satisfying $$\Pr[T \le t^*_{n_1-1,0.975}] = 0.975.$$ The first statement (2a) is about the two-sided probability of the sampling distribution. The second statement (2b) pertains to the coverage probability of a confidence interval constructed from the estimates of the mean and standard deviation of the data.

  1. If I transfer my sample set $n2$ to Z standard, again, 95% of $n2$ data points fall within $\overline{X_2} \pm 1.96S_2$ $$ \color{blue}{\Pr(\overline{X_2}-1.96S_2 < X_2 < \overline{X_2}+1.96S_2) = 0.95} \tag{3} $$

See above.

  1. Obviously, I would take many sample sets $n3,n4,n5, \cdots nk$ so my eventual sampling distribution has $\overline{X} \rightarrow \mu$ and $S \rightarrow \sigma$ $$ \color{blue}{\Pr(\overline{X}-1.96S < X < \overline{X}+1.96S) = 0.95} \tag{4} $$ $$ \color{blue}{\Pr(\mu-1.96\sigma < X < \mu+1.96\sigma) = 0.95} \tag{5} $$

Again, your notation is unclear because you have not precisely defined what you mean by $\bar X$, $X$, and $S$.

The rest of your questions should not be addressed until you have understood the meaning of, and difference between, equations (2a) and (2b) as I have written them, and after you have defined your notation in terms of the underlying population distribution $Y$.


In so far as inverting the test statistic to obtain a $100(1-\alpha)\%$ confidence interval, suppose the population standard deviation $\sigma$ is known. Then $$Z = \frac{\bar X - \mu}{\sigma/\sqrt{n}} \sim \operatorname{Normal}(0,1)$$ is a pivotal quantity. Consequently, $$\Pr\left[|Z| < z^*_{\alpha/2} \right] = 1 - \alpha,$$ where $z^*_{\alpha/2}$ is the critical value. Hence $$\begin{align*} 1 - \alpha &= \Pr\left[-z^*_{\alpha/2} < Z < z^*_{\alpha/2} \right] \\ &= \Pr\left[z^*_{\alpha/2} \frac{\sigma}{\sqrt{n}} > -\bar X + \mu > z^*_{\alpha/2} \frac{\sigma}{\sqrt{n}} \right] \\ &= \Pr\left[\bar X - z^*_{\alpha/2} \frac{\sigma}{\sqrt{n}} < \mu < \bar X +z^*_{\alpha/2} \frac{\sigma}{\sqrt{n}} \right].\end{align*}$$ What you wrote in your question is pretty much the same thing. When the data are known to be drawn from a normal distribution, then the pivotal quantity is exactly standard normal. If not, for sufficiently large $n$, it is asymptotically normal.

4
On

The theory works as follows:

  • you know the theoretical distribution of the samples, but there are unknown parameters which you want to estimate;

  • by probability computation, given a particular sample observation $\{ x_1, \cdots x_n\}$ you can estimate the conditional distribution of, say, the mean $\mu$, $f_M(\mu\mid x_k)$.

Now knowing this distribution, you can compute the probability that $\mu$ lies in a certain interval, or conversely in what interval $\mu$ lies with a given probability, say $95\%$.

As we are in a probabilistic world, if we repeat this experiment, on average the mean will truly be in the confidence interval $95\%$ of the time.

Note that for every experiment the sample will be different, so will the estimated distribution of $\mu$ be, as well as the confidence interval. But as the confidence level is always $95\%$, the success rate remains $95\%$.