Sigma of 50 element sample vs sigma of the average of 5 samples of 10 elements

37 Views Asked by At

I would like to understand a statement I found on a book concerning the explanation of confidence interval which is not clear to me.

We have two different situations and I would like to understand if the standard deviation is different and if it is bigger in case a or b.

Case a) We have 50 elements and I calculate the average ($\bar x$) and standard deviation (σ).

Case b) We take 5 samples of 10 elements each of the same population. We calculate $\bar x$ and σ and then from the averages and standard deviations we calculate the global $\bar x$ and σ.

The average should be the same but what about the standard deviation of the Case b in comparison with Case a? And why?

Thank you in advance for any eventual reply!

Giacomo

1

There are 1 best solutions below

0
On BEST ANSWER

(a) Samples of size $n = 50$ from $\mathsf{Norm}(\mu = 60, \sigma = 7):$ We estimate $\mu$ by $A_{50} =\bar X$ and $\sigma$ by $S_{50}^2.$

Also, $49S_{50}^2/\sigma^2 \sim \mathsf{Chisq}(\nu=49).$ Then a 95% CI for $\sigma^2$ is of the form $\left(\frac{49S_{50}^2}{U},\, \frac{49S_{50}^2}{L}\right),$ where $L=63.88$ and $U=70.22$ cut probabilities $0.025$ from the lower and upper tails, respectively, of $\mathsf{Chisq}(\nu = 49).$ Take square roots of endpoints to get a 95% CI for $\sigma.$

qchisq(c(.925,.975), 49)
[1] 63.88477 70.22241

(b) If you take five samples of size $n=10$ from this normal distribution you will have five variance estimates $S_1^2, S_2^2, \dots, S_5^2.$ Then you can "pool" these five variance estimates to get an estimate $S_p^2$ as below:

$$S_p^2 = \frac{9S_1^2 + 9S_2^2 + \cdots + 9S_5^2}{5(9)}.$$ Then $45S_p^2/\sigma^2\sim\mathsf{Chisq}(\nu=45).$ Because the sub-sample sizes $(10)$ are all equal, $S_p^2$ is just the average of the five $S_j^2, j=1,2,3,4,5,$

Then a 95% CI for $\sigma^2$ is of the form $\left(\frac{45Sp^2}{L},\,\frac{45Sp^2}{U}\right),$ where $L$ and $U$ are from $\mathsf{Chisq}(\nu=45).$ Again here, you can take square roots of endpoints to get a CI for $\sigma.$

qchisq(c(.925,.975), 45)
[1] 59.28717 65.41016

In a one-factor ANOVA with five levels of the factor and ten replications at each level, this is the standard method of estimating the (assumed identical) $\sigma^2$ within each level. In an ANOVA, the means of the five levels have other uses, so this method is reasonable.

However, if you are just estimating $\sigma^2$ (or $\sigma),$ then pooling is not an efficient procedure. You have "lost" four degrees of freedom by making five estimates of $\mu$ (from the five small-sample means)---instead of only one (from the one large sample). With fewer degrees of freedom, the CIs will tend to be longer if you use $S_p^2$ than if you use $S_{50}^2.$

Because $S_{50}^2$ and $S_p^2$ are both unbiased estimators of $\sigma^2 = 7^2 = 49,$ the two estimates will tend to be about the same size. However, $S_p^2$ (based on fewer degrees of freedom) will tend to be more variable.

This is illustrated by the two simulations in R below:

One large sample:

set.seed(221)
m = 10^5;  v.50 = numeric(m)
for(i in 1:m) {
 v.50[i] = var(rnorm(50, 60, 7)) 
}
mean(v.50)
[1] 49.00551
var(v.50)
[1] 96.81574
sd(v.50)
[1] 9.839499

Five small samples:

set.seed(221)
m = 10^5;  v.p = numeric(m)
for(i in 1:m) {
 v.p[i] = mean(replicate(5, var(rnorm(10,60, 7)))) 
}
mean(v.p)
[1] 49.00486
var(v.p)
[1] 105.2702
sd(v.p)
[1] 10.26013

Finally, the title of your question may contemplate looking at the variance of the five means of 10 observations. If the purpose is to estimate the population variance, this might be a catastrophically bad ides: These means have variance $\sigma^2/10,$ so you'd have to multiply by $10$ to estimate the population variance. Such an estimate would be extremely variable (based on only 4 degrees of freedom).

set.seed(221)
m = 10^5;  v.a = numeric(m)
for(i in 1:m) {
 v.a[i] = var(replicate(5, mean(rnorm(10,60, 7)))) 
}
mean(10*v.a)
[1] 49.01282
var(10*v.a)
[1] 1208.783
sd(10*v.a)
[1] 34.76755