Creating a confidence interval when sigma is unknown

1.2k Views Asked by At

This is something which has plagued me for months now. And i am not really able to explain it to anyone. I know how to CALCULATE the confidence interval when sigma is unknown . But i don't get how is it working. My problem is a more conceptual one. I cannot get an intuitive understanding of how is it working. I have been reading and reading and watching some videos which tell you how to calculate a confidence interval but i still fail to understand how does it work? What piece of information am i missing?

What i fail to understand is that i have a SAMPLE . How can i figure out where a population mean is likely to lie just by using a sample.

I was doing a question today which said that in a test we had 16 independent observations with a mean of 100 and SD of 24. And I had to find confidence level at 0.05 interval. I followed the following steps like a Robot and found the confidence interval but i dont really know how did it work

I found the Sm = s/ √ N , which is 24/ √ 16=6

df=N-1=16-1

Then i read a t table which said that the values are 2.13 and 2.95

Then,

M Plus-minus 2.13x SEm= 100 plus-minus 2.13x6= 100+- 12.78

Which means that there are only 5 chances out of 199 that population mean will lie beyond the limit 112.78 - 82.22

I mean i can calculate this but i feel like i just followed some steps to figure it out. I don't how or why they work. To me it seems impossible that just with a SAMPLE i can figure out where a population mean is likely to lie. Can anyone explain me conceptually how is it happening? It seems like magic to me. Why is a t distribution able to find the likelhood of sigma? How can we find something about population about which we have no information?

1

There are 1 best solutions below

0
On

If you know the population standard deviation $\sigma,$ then you can do a z confidence interval as follows: Suppose you have $\bar X = 102$ based on $n = 16$ randomly chosen subjects from a normal population $\mathsf{Norm}(\mu, \sigma).$

The distance between $\bar X$ and $\mu$. the observed value for each individual subject has distribution $\mathsf{Norm}(\mu, \sigma).$ Also, if you have $n = 16$ subjects, then their average $\bar X \sim \mathsf{Norm}(\mu, \sigma/\sqrt{n}) \equiv \mathsf{Norm}(\mu, \sigma/4).$

Moreover, if you know that $\sigma = 22,$ then the distribution of $\bar X$ becomes $\mathsf{Norm}(\mu, \sigma=22/4 = 5.5).$ In order to get a 95% confidence interval for $\mu,$ you can standardize to get $$.95 = P\left(-1.96 \le Z = \frac{\bar X - \mu}{5.5} \le 1.96\right)\\ =P\left(-1.96(5.5) \le \bar X - \mu \le 1.96(5.5)\right)\\ =P\left(1.96(5.5) \ge \mu - \bar X \ge -1.96(5.5)\right)\\ =P\left(-1.96(5.5) \le \mu - \bar X \le 1.96(5.5)\right)\\ =P\left(\bar X-1.96(5.5) \le \mu \le \bar X +1.96(5.5)\right).$$

The general idea of the last few displayed lines and of the formula for the confidence interval below is that, with high probability, the sample mean $\bar X$ and the population mean $\mu$ are not farther apart than $1.96\sigma/\sqrt{n},$ which is known as the 'margin of error'.

From the last version of the event with 95% probability, we get a 95% confidence interval for $\mu$ of the form: $\bar X \pm 1.96(5.5)$ or $\bar X \pm 10.78.$ Finally, if you get data and find $\bar X = 102$ then the CI is $(91.22, 112.78).$ Recall that the general formula---without the numbers plugged in--- is $\bar X \pm 1.96\sigma/\sqrt{n}.$

When $\sigma$ is unknown and estimated by $S$. If the population standard deviation $\sigma$ is not known, then we use the sample standard deviation $S$ as an estimate. Say $S = 24.1.$ However, you can't just plug $S$ for $\sigma$ into the formula above. Your confidence interval needs to be a little longer now, expressing a little less certainty, in order to make up for using the estimate $S$ instead of $\sigma.$

This adjustment is made by using Student's t distribution. If you have $n = 16$ observations, then the t distribution has $\nu = n-1 = 16 - 1 = 15$ degrees of freedom. [That number comes from the $n-1$ in the denominator when you find $S = \sqrt{\frac{\sum_i (X_i - \bar X)^2}{n-1}}$ from data.] Looking in row $\nu = 15$ of a printed table of t distributions, you find that the substitute value for $1.96$ is the slightly larger number $2.131,$ which cuts probability $0.025 = 2.5\%$ from the upper tail of the t distribution.

Thus, instead of using the z confidence interval $\bar X \pm 1.96\sigma/\sqrt{16},$ where $\sigma$ is known, you use the t confidence interval $\bar X \pm 2.131\,S/\sqrt{16}.$

Historical note. This adjustment to make the confidence interval a little longer when $\sigma$ is replaced by $S$ was proposed by W. S. Gossett. He wrote under the pseudonym "A. Student," because his employer (a brewery) did not want its competition to start using the improved confidence interval right away. (There are several versions of this story, as can happen with things that happened 85-90 years ago.)

It took several years to find the exact mathematical form of t distributions and then to compute the numbers found in printed t tables. If you continue on to theoretical courses in statistics, you may see the the formula for the t distribution and perhaps see how it was developed. For now, using the printed t table does just what you need. Many statistical software programs compute the same information as in t tables. I got my value above from R as follows:

qt(.975, 15)
[1] 2.13145