I would like to understand a statement I found on a book concerning the explanation of confidence interval which is not clear to me.
We have two different situations and I would like to understand if the standard deviation is different and if it is bigger in case a or b.
Case a) We have 50 elements and I calculate the average ($\bar x$) and standard deviation (σ).
Case b) We take 5 samples of 10 elements each of the same population. We calculate $\bar x$ and σ and then from the averages and standard deviations we calculate the global $\bar x$ and σ.
The average should be the same but what about the standard deviation of the Case b in comparison with Case a? And why?
Thank you in advance for any eventual reply!
Giacomo
(a) Samples of size $n = 50$ from $\mathsf{Norm}(\mu = 60, \sigma = 7):$ We estimate $\mu$ by $A_{50} =\bar X$ and $\sigma$ by $S_{50}^2.$
Also, $49S_{50}^2/\sigma^2 \sim \mathsf{Chisq}(\nu=49).$ Then a 95% CI for $\sigma^2$ is of the form $\left(\frac{49S_{50}^2}{U},\, \frac{49S_{50}^2}{L}\right),$ where $L=63.88$ and $U=70.22$ cut probabilities $0.025$ from the lower and upper tails, respectively, of $\mathsf{Chisq}(\nu = 49).$ Take square roots of endpoints to get a 95% CI for $\sigma.$
(b) If you take five samples of size $n=10$ from this normal distribution you will have five variance estimates $S_1^2, S_2^2, \dots, S_5^2.$ Then you can "pool" these five variance estimates to get an estimate $S_p^2$ as below:
$$S_p^2 = \frac{9S_1^2 + 9S_2^2 + \cdots + 9S_5^2}{5(9)}.$$ Then $45S_p^2/\sigma^2\sim\mathsf{Chisq}(\nu=45).$ Because the sub-sample sizes $(10)$ are all equal, $S_p^2$ is just the average of the five $S_j^2, j=1,2,3,4,5,$
Then a 95% CI for $\sigma^2$ is of the form $\left(\frac{45Sp^2}{L},\,\frac{45Sp^2}{U}\right),$ where $L$ and $U$ are from $\mathsf{Chisq}(\nu=45).$ Again here, you can take square roots of endpoints to get a CI for $\sigma.$
In a one-factor ANOVA with five levels of the factor and ten replications at each level, this is the standard method of estimating the (assumed identical) $\sigma^2$ within each level. In an ANOVA, the means of the five levels have other uses, so this method is reasonable.
However, if you are just estimating $\sigma^2$ (or $\sigma),$ then pooling is not an efficient procedure. You have "lost" four degrees of freedom by making five estimates of $\mu$ (from the five small-sample means)---instead of only one (from the one large sample). With fewer degrees of freedom, the CIs will tend to be longer if you use $S_p^2$ than if you use $S_{50}^2.$
Because $S_{50}^2$ and $S_p^2$ are both unbiased estimators of $\sigma^2 = 7^2 = 49,$ the two estimates will tend to be about the same size. However, $S_p^2$ (based on fewer degrees of freedom) will tend to be more variable.
This is illustrated by the two simulations in R below:
One large sample:
Five small samples:
Finally, the title of your question may contemplate looking at the variance of the five means of 10 observations. If the purpose is to estimate the population variance, this might be a catastrophically bad ides: These means have variance $\sigma^2/10,$ so you'd have to multiply by $10$ to estimate the population variance. Such an estimate would be extremely variable (based on only 4 degrees of freedom).