Unbiased estimator of the variance with known population size

3.1k Views Asked by At

The variance is defined as

$$\sigma^2 = \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n}$$

where, $\bar x = \frac{\sum_{i=1}^n x_i}{n}$

If someone wants to estimate this parameter from a sample (s)he must do

$$s^2 = \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n-1}$$

as the variance (as would be calculated by $\sigma^2$) of a sample decreases with the size of the sample.

$s^2$ is an unbiased estimator of $\sigma^2$ only if sampling is with replacement (which is not the case in the model of interest here) or if the population is infinite. Let's call $N$ the size of the population ($n$ being the size of the sample). To the extreme, if $n=N$ (so that every individual is sampled), then $s^2$ will definitely be a biased estimator of the variance in the population.

What is an unbiased estimator of the variance of the population from a sample knowing the population size $N$?

3

There are 3 best solutions below

2
On BEST ANSWER

Let's assume you have a population of size $N$ with values $x_1,\ldots,x_N$, mean $\bar x=\frac{1}{N}\sum_{i=1}^N x_i$ and variance $\sigma^2=\frac{1}{N}\sum_{i=1}^N(x_i-\bar x)^2$. (Note that I use lower case $x_i$ to indicate these are not random, but fixed values.)

Now, let's take a random sample $Y_1,\ldots,Y_n$ of $n$ elements (without replacement), with all such subsets equally likely. (Now I use capital $Y$ to indicate these are random.)

Now, $\bar Y=\frac{1}{n}\sum_{i=1}^n Y_i$ and let $V=\sum_{i=1}^n (Y_i-\bar Y)^2$ so that the sample variance would be $V/n$ (like the expression for $\sigma^2$). If we write $V$ out in terms of $(Y_i-\bar x)^2$ and $(Y_i-\bar x)(Y_j-\bar x)$, we get $$ \begin{split} V =& \sum_{i=1}^n (Y_i-\bar Y)^2 = \sum_{i=1}^n \left[(Y_i-\bar x)-(\bar Y-\bar x)\right]^2 \\ =& \sum_{i=1}^n \left[(Y_i-\bar x)^2-2(Y_i-\bar x)(\bar Y-\bar x)+(\bar Y-\bar x)^2 \right] \\ =& \sum_{i=1}^n (Y_i-\bar x)^2 - n(\bar Y-\bar x)^2 \\ =& \left(1-\frac{1}{n} \right) \sum_{i=1}^n (Y_i-\bar x)^2 -\frac{2}{n}\sum_{1\le i<j\le n} (Y_i-\bar x)(Y_j-\bar x) \end{split} $$ where in the last step we use that $$ \left(\sum_{i=1}^n (Y_i-\bar x)\right)^2 = \sum_{i=1}^n (Y_i-\bar x)^2 + 2\sum_{1\le i<j\le n} (Y_i-\bar x)(Y_j-\bar x). $$

We know that $\text{E}[(Y_i-\bar x)^2]=\sigma^2$: this is just taking the average of $(Y_i-\bar x)^2$ for $Y_i$ sampled from $x_1,\ldots,N$.

For $i<j$, we can compute $\text{E}[(Y_i-\bar x)(Y_j-\bar x)]$ by using that this is the same as the average of $(x_i-\bar x)(x_j-\bar x)$ for all $1\le i<j\le N$. Since $\sum_{i=1}^N (x_i-\bar x)=0$, we get $$ 0 = \sum_{1\le i,j\le N} (x_i-\bar x)(x_j-\bar x) = \sum_{i=1}^N (x_i-\bar x)^2 + 2\sum_{1\le i<j\le N} (x_i-\bar x)(x_j-\bar x) $$ which for $i<j$ makes $$ \text{E}\left[(Y_i-\bar x)(Y_j-\bar x)\right] = -\frac{\sigma^2}{N-1}. $$ Combining these results, we get $$ \text{E}[V] = (n-1)\sigma^2 + \frac{n-1}{N-1}\sigma^2 = \frac{(n-1)N}{N-1}\sigma^2 $$ giving an unbiased estimator $$ \hat\sigma^2 = \frac{N-1}{N(n-1)}V = \frac{N-1}{N(n-1)} \sum_{i=1}^n (Y_i-\bar Y)^2. $$

As $N\rightarrow\infty$, you get the familiar $s^2$ estimator which corresponds to independent sampling from a distribution, while $n=N$ gives just $\sigma^2$ as it should when the $x_i$ are known for the whole population.

7
On

The statement that $\displaystyle\sigma^2 = \frac{\sum_{i=1}^n (x_i - \bar x)}{n}$ is true only if each value $x_i$ has the same probability $1/n$ and $x_1,\ldots,x_n$ includes every member of the population exactly once.

The usual argument for the proposition that $\displaystyle \frac{\sum_{i=1}^n (x_i-\bar x)^2 } {n-1}$ is an unbiased estimator of the population assumes $x_1,\ldots,x_n$ is an i.i.d. sample from the population, and does not require equal probability assigned to all elements. In effect you are using the letter $n$ to refer to two different things, and that confuses matters.

"Infinite population" should not be taken too literally. People say that in order to indicate that it's an i.i.d. sample. The point is that as the population size grows, the distribution of an i.i.d. sample and that of a sample without replacement approach each other.

Suppose the population has three equally probable members, and the values of the random variable for those three are $1,2,3.$

Then the population variance is $$ \sigma^2 = \frac{(1-2)^2 + (2-2)^2 + (3-2)^2} 3 = \frac 2 3. $$ Suppose a sample of size $n=2$ is taken. Then the following are the possible samples and the unbiased sample variances $\sum_{i=1}^2 (x_i-\bar x)^2/(2-1)$: $$ \begin{array} {c|l|l} \text{sample} & \bar x = \text{sample mean} & s^2 = \text{sample variance} \\ \hline 1, 1 & 1 & 0 \\ 1, 2 & 1.5 & 0.5 \\ 1, 3 & 2 & 2 \\ 2, 1 & 1.5 & 0.5 \\ 2, 2 & 2 & 0 \\ 2, 3 & 2.5 & 0.5 \\ 3, 1 & 2 & 2 \\ 3, 2 & 2.5 & 0.5 \\ 3, 3 & 3 & 0 \\ \hline \end{array} $$ Observe that the average of the nine possible sample variances is $2/3,$ thus the sample variance is an unbiased estimator of the population variance.

This population may be an infinite one in which $1/3$ of the members have the value $1$ for this random variable, and $1/3$ of them have $2$ and $1/3$ of them have $3$, or it may be a population with only three members. Either way, all of the above holds.

One question I've answered a few times here is: Why divide by $n-1$? I won't go through that again here, but this concrete example gives us an opportunity to look at one aspect of that.

First, suppose in computing the sample variances, we had looked at deviations from the population mean rather than from the sample mean. Then dividing by $n=2$ rather than $n-1=1$ would make the average of the estimates of variance equal to the population variance.

Second, observe that some simple arithmetic applied to the nine cases will show you why use of the sample mean rather than the population mean makes the average estimate of variance smaller, unless you compensate by reducing the denominator from $n$ to $n-1$.

1
On

$\newcommand{\v}{\operatorname{var}} \newcommand{\e}{\operatorname{E}} \newcommand{\c}{\operatorname{cov}}$ The linear nature of expectation tells us that $$ \e\left( \sum_{i=1}^n (X_i - \bar X)^2 \right) = \sum_{i=1}^n \e \left( (X_i-\bar X)^2 \right) = n\e\left( (X_1- \bar X)^2 \right) \tag 0 $$ where the second equality follows from the fact that by symmetry, all of the expectations are equal.

So let us examine the expectation on the right side of the equality above. \begin{align} & \e\left( (X_i - \bar X)^2\right) = \e(X_i^2 - 2X_i \bar X + \bar X^2) = \e(X_i^2) -2\e(X_i\bar X) + \e(\bar X^2) \\[10pt] = {} & \e(X_i^2) - \frac 2 n \e(X_i(X_1 + \cdots + X_n)) + \frac 1 {n^2} \e\left( (X_1+\cdots+X_n) (X_1+\cdots+X_n) \right). \qquad \tag 1 \end{align}

So we need to know $\e(X_i^2)$ and $\e(X_i X_j)$ for $i\ne j.$

We have $\e(X_i^2) = \mu^2+\sigma^2.$ Next: $$ \e(X_i X_j) = \e(\e(X_i X_j \mid X_j)) = \e(X_j \e(X_i\mid X_j)), $$ so let us find $\e(X_i\mid X_j).$ \begin{align} \e(X_i\mid X_j=x) & = \frac{\text{sum of possible values except } x}{N-1} \\[10pt] & = \frac{\text{sum of all possible values}}{N-1} - \frac x {N-1} \\[10pt] & = \frac N {N-1} \mu - \frac x {N-1}, \end{align} so $$ \e(X_i\mid X_j) = \frac {N\mu - X_j} {N-1}. $$ Therefore $$ \e(X_i X_j) = \e(X_j \e(X_i \mid X_j)) = \e\left( X_j \frac{N\mu-X_j}{N-1} \right) = \frac{N\mu^2}{N-1} - \frac{\mu^2+\sigma^2}{N-1} = \mu^2 - \frac{\sigma^2}{N-1}. $$ Now we have \begin{align} \e(X_i^2) & = \mu^2+\sigma^2 \\[10pt] \e(X_i(X_1+\cdots+X_n)) & = (n-1)\left( \mu^2 - \frac{\sigma^2}{N-1} \right) + (\mu^2+\sigma^2) \\[10pt] & = n\mu^2 + \frac{N-n}{N-1} \sigma^2 \\[10pt] \e((X_1+\cdots+X_n)(X_1+\cdots+X_n)) & = n^2\mu^2 + n \frac{N-n}{N-1} \sigma^2 \end{align} Hence line $(1)$ above becomes \begin{align} & \Big(\mu^2 + \sigma^2 \Big) - \frac 2 n \left( n\mu^2 + \frac{N-n}{N-1} \sigma^2 \right) + \frac 1 {n^2} \left( n^2\mu^2 + n \frac{N-n}{N-1} \sigma^2 \right) \\[10pt] = {} & \frac{N(n-1)}{n(N-1)} \sigma^2. \end{align} Then line $(0)$ becomes $$ \frac{N(n-1)}{N-1} \sigma^2. $$ The factor by which line $(0)$ must be multiplied to get $\sigma^2$ is therefore $\displaystyle \frac{N-1}{N(n-1)}.$