The relationship between sample variance and proportion variance?

1.2k Views Asked by At

I'm trying to see the relationship between the sample variance equation

$\sum(X_i- \bar X)^2/(n-1)$ and the variance estimate, $\bar X(1-\bar X),$ in case of binary samples.

I wonder if the outputs are the same, or if not, what is the relationship between the two??

I'm trying to prove their relationship but it's quite challenging to me..

Please help!

Sigma(Xi-Xbar)/(n-1) Xbar(1-Xbar)

2

There are 2 best solutions below

0
On

I suppose your question is whether the two formulas give the same answer for binary data. Here is an example to illustrate that they are almost the same, but not exactly.

Suppose I have a sample of a thousand zeros and ones in which there are 283 ones. Then $\bar X = 283/1000 = 0.283.$ Thus, $\bar X(1-\bar X) = 0.283(1 - 0.283) = 0.202911.$

An alternate general formula for the sample variance of values $X_i$ is

$$S^2 = \frac{\sum_{i=1}^n X_i^2 - n \bar X^2}{n-1}.$$

In a binary sample $\sum_{i=1}^n X_i^2 = \sum_{i=1}^n X_i$, because $0^2 = 0$ and $1^2 = 1.$

Thus, the general formula gives $S^2 = \frac{283 - 1000(.283)^2}{999} = 0.2031141.$ If (as in the Comment by @A.S) the denominator were $n = 1000$ instead of $n-1=999,$ this would simplify to $$S^2 = 0.283 - 0.283^2 = 0.283(1 = 0.283) = \bar X(1- \bar X).$$

The formula for the population variance is often written with the population size $n$ in the denominator.

0
On

The first quantity is the standard variance estimator that is unbiased for i.i.d samples from any distribution.

The second quantity is a simplified formula (the simplification being valid only for 0-1 binary data) for calculating exactly, not estimating, the variance of the sample.

Using the second instead of the first to estimate the distribution variance will, on average, lead to slight underestimates. This is equivalent to the use of $n$ instead of $n-1$ in the denominator of the estimator.