Question
Consider a population of known size $N$, from which we sample $n$ individuals without replacement and measure their trait value $X_i$. All traits values are bounded between $[0,1]$. Let $\mu = \frac{\sum_i^N x_i}{N}$ be the average trait value in the population. If it makes things easier, I am happy to consider that $X$ is boolean ($0$ or $1$).
What is the unbiased estimator for $\mu - \mu^2$?
Attempt
$$ \begin{align} \mu - \mu^2 &= \operatorname{E}[\bar X] - \operatorname{E}[\bar X]^2\\ &= \operatorname{E}[\bar X] - \operatorname{E}[\bar X]^2 - \operatorname{var}[\bar X] + \operatorname{var}[\bar X] \\ &= \operatorname{E}[\bar X] - \operatorname{E}[\bar X]^2 - \operatorname{var}[\bar X] + \frac{1}{n} \sigma^2 \\ &= \operatorname{E}[\bar X] - \operatorname{E}[\bar X^2] + \frac{1}{n} \operatorname{E}[\hat \sigma^2], \end{align}$$
where $\hat \sigma^2$ is the estimator for the variance. Looking at this post,
$$\hat\sigma^2 = \frac{N-1}{N(n-1)} \sum_{i=1}^n (X_i-\bar X)^2$$
Hence,
$$\mu - \mu^2 = \operatorname{E}\left[ \bar X - \bar X^2 + \frac{1}{n} \frac{N-1}{N(n-1)} \sum_{i=1}^n (X_i-\bar X)^2\right] $$
Test
Looked good to me so I tried it out (in R)
nbtrials = 5000 # number of independent samples
N = 20 # Number of individuals in the population
pop = rep(0:1, N/2) # Create a boolean population with $\mu = 0.5$
n = 17 # Sample size
out = numeric(nbtrials) # out will collect the estimates of $\mu - \mu^2$
for (trial in 1:nbtrials)
{
s = sample(pop,size=n, replace=FALSE) # Take a sample without replacement
xbar = sum(s) / n # Compute the average frequency in the sample
out[trial] = xbar - xbar^2 + 1/n * (N - 1) / (N*(n-1)) * sum((s - mean(s))^2) # Compute the estimator
}
# Now compare the population with the average of estimations taken from the samples
trueMean=sum(pop) / length(pop)
print(paste("True value = ",trueMean *(1-trueMean)))
print(paste("Average estimated value = ",mean(out)))
which outputs something like
"True value = 0.25"
"Average estimated value = 0.262343771626298"
showing a clear systematic overestimation. Where did I go wrong?
Where does the mistake lie?
From @EinarRødland's comment, it seems likely that my mistake is in the equation $var[\bar X] = \frac{\sigma^2}{n}$, which is wrong due to the fact that sampling occurs without replacement from a finite population.
The variance of the sampling distribution is not equal to the sample variance over the sample size when sampling is with replacement. From ma.utexas, the variance of the sampling distribution when sampling without replacement is
$$\operatorname{var}[\bar X] = \frac{s^2}{n} \left(1-\frac {n-1} {N-1} \right) = \frac{\sum (X_i - \bar X)^2}{n(n-1)} \left(1-\frac {n-1} {N-1} \right)$$
Therefore,
$$\mu - \mu^2 = \operatorname E\left[\bar X - \bar X^2 + \frac{\sum (X_i - \bar X)^2}{n(n-1)} \left(1-\frac {n-1} {N-1} \right)\right]$$
Thanks to @EinarRødland for pointing out where the mistake was.