How do I intuitively understand that the variance of the population mean will always be greater than the variance of the sample mean?

307 Views Asked by At

I am currently trying to figure out why “n-1” is used in statistics, but the first component of figuring this out is that supposedly the sample mean “minimizes” the variance of the observations, making it so that it can never be greater than the variance of the observations from the population mean. Apparently this is because the positive and negative distances between the observations and sample mean add up to zero. The population mean, unless it is the same value as the sample mean, will either be to the left or the right of the sample mean, thus causing an imbalance with positive or negative sum. However, I’ve also been told the reason population mean can’t be below the sample mean is because the differences have to be squared anyway, preventing there from being any negative differences. But if this is true, how can the differences from the sample mean cancel out to zero? The only reason they cancel out is because once you add them together, you are including both negative and positive values.

For example, let’s say I have the observations 2 and 6. The mean of this sample is 4. If I add up the deviations from the sample mean (-2 and +2), they will cancel out to zero. But if the population mean is, say, 5, that would make the sum of the deviations from the population mean -2 (-3 and +1). -2 is less than 0, but this can’t be because the variance of the population mean can only be greater than the variance of the sample mean? But then if we square them to ensure there are no negatives, that would also prevent the variance of the sample mean from canceling out to zero (it would go from -2 and 2, to +2 and +2, which would equal 4. Or if squared and added (4+4), would be 8)

I understand if the population mean goes above the highest observation of the sample, as that would increase the distance from all the observations and make a larger sum. But what if the population mean is actually above the sample mean, but lower than the highest observation of the sample. The distance on one side has been increased, but lowered on the other side, so wouldn’t this also cancel or average out to be the same? Using the same numbers from above (2 and 6 with a sample mean of 4 and population mean of 5) for the sample mean, the (positive) deviations (2 and 2) would equal 4, and for the population mean (5), which would shift it one unit to the right and change the deviations to 3 and 1 (|2-5| and |6-5|) would also add to 4?

I would appreciate it if someone could explain in layman’s terms and not use all the greek symbols and hieroglyphics, because that’s what got me into this mess. Thank you in advance

1

There are 1 best solutions below

0
On

Consider a random sample $X_1, X_2, \dots, X_n$ from a population with $X_i \sim\mathsf{Norm}(\mu,\sigma).$ It is not necessarily true that the sample variance $S^2 > \sigma^2.$

Example, in R: The sample below has $S^2 = 123.35 < \sigma^2 = 15^2 = 225.$

set.seed(710)
x = rnorm(10, 100, 15); sd(x); var(x)
[1] 11.10642   # sample SD
[1] 123.3525   # sample variance

In general, $\frac{(n-1)S^2}{\sigma^2} \sim\mathsf{Chisq}(\nu=n-1)$ For the example above $E(S^2) = \sigma^2.$ Here is a simulation of $100\,000$ such sample variances. The mean of the distribution of $S^2 = V$ is $225$ with a large variance. So a considerable proportion of these sample means takes values above 225: $P(S^2 > 225)\approx 0.44.$ (The distribution of $S^2$ is a multiple of a chi-squared distribution.)

set.seed(2021)
v = replicate(10^5, var(rnorm(10,100,15)))
mean(v); var(v)
[1] 225.4242
[1] 11359.52
mean(v > 225)
[1] 0.43901

hist(v, prob=T, col="skyblue2")
 abline(v=225, lwd=2, col="red")

enter image description here