Estimator of $\mu - \mu^2$ when sampling without replacement

360 Views Asked by At

Question

Consider a population of known size $N$, from which we sample $n$ individuals without replacement and measure their trait value $X_i$. All traits values are bounded between $[0,1]$. Let $\mu = \frac{\sum_i^N x_i}{N}$ be the average trait value in the population. If it makes things easier, I am happy to consider that $X$ is boolean ($0$ or $1$).

What is the unbiased estimator for $\mu - \mu^2$?

Attempt

$$ \begin{align} \mu - \mu^2 &= \operatorname{E}[\bar X] - \operatorname{E}[\bar X]^2\\ &= \operatorname{E}[\bar X] - \operatorname{E}[\bar X]^2 - \operatorname{var}[\bar X] + \operatorname{var}[\bar X] \\ &= \operatorname{E}[\bar X] - \operatorname{E}[\bar X]^2 - \operatorname{var}[\bar X] + \frac{1}{n} \sigma^2 \\ &= \operatorname{E}[\bar X] - \operatorname{E}[\bar X^2] + \frac{1}{n} \operatorname{E}[\hat \sigma^2], \end{align}$$

where $\hat \sigma^2$ is the estimator for the variance. Looking at this post,

$$\hat\sigma^2 = \frac{N-1}{N(n-1)} \sum_{i=1}^n (X_i-\bar X)^2$$

Hence,

$$\mu - \mu^2 = \operatorname{E}\left[ \bar X - \bar X^2 + \frac{1}{n} \frac{N-1}{N(n-1)} \sum_{i=1}^n (X_i-\bar X)^2\right] $$

Test

Looked good to me so I tried it out (in R)

nbtrials = 5000      # number of independent samples
N = 20               # Number of individuals in the population
pop = rep(0:1, N/2)  # Create a boolean population with $\mu = 0.5$
n = 17               # Sample size

out = numeric(nbtrials)  # out will collect the estimates of $\mu - \mu^2$
for (trial in 1:nbtrials)
{
    s = sample(pop,size=n, replace=FALSE)   # Take a sample without replacement
    xbar = sum(s) / n                       # Compute the average frequency in the sample
    out[trial] = xbar - xbar^2 + 1/n * (N - 1) / (N*(n-1)) * sum((s - mean(s))^2)   # Compute the estimator
}

# Now compare the population with the average of estimations taken from the samples
trueMean=sum(pop) / length(pop)
print(paste("True value  = ",trueMean *(1-trueMean)))
print(paste("Average estimated value = ",mean(out)))

which outputs something like

"True value  =  0.25"
"Average estimated value =  0.262343771626298"

showing a clear systematic overestimation. Where did I go wrong?

Where does the mistake lie?

From @EinarRødland's comment, it seems likely that my mistake is in the equation $var[\bar X] = \frac{\sigma^2}{n}$, which is wrong due to the fact that sampling occurs without replacement from a finite population.

1

There are 1 best solutions below

0
On BEST ANSWER

The variance of the sampling distribution is not equal to the sample variance over the sample size when sampling is with replacement. From ma.utexas, the variance of the sampling distribution when sampling without replacement is

$$\operatorname{var}[\bar X] = \frac{s^2}{n} \left(1-\frac {n-1} {N-1} \right) = \frac{\sum (X_i - \bar X)^2}{n(n-1)} \left(1-\frac {n-1} {N-1} \right)$$

Therefore,

$$\mu - \mu^2 = \operatorname E\left[\bar X - \bar X^2 + \frac{\sum (X_i - \bar X)^2}{n(n-1)} \left(1-\frac {n-1} {N-1} \right)\right]$$

Thanks to @EinarRødland for pointing out where the mistake was.