Standard error on subsample

26 Views Asked by At

Suppose that I know that $N$ samples $(x_1, x_2,...,x_N)$ are iid drawn from a distribution with known variance $\sigma^2$. I also observe the first $k<<N$ samples, and estimate the mean on these $k$ samples. How 'good' is the mean, i.e. what variance / standard error does my estimate have?

My approach is as follows. Denote the mean I estimated on the first $k$ samples by $\mu_k$. Since I know that there where $N$ samples, I know that my mean on the full set of observations would be $$\bar{x} = \frac{1}{N} \sum_{i=1}^N x_i = \frac{1}{N} \left(k \mu_k + \sum_{j=k+1}^N x_i \right).$

I can then write the change in my mean estimate as $$ \mu_k - \bar{x} = \frac{N \mu_k - k \mu_k - \sum_{j=k+1}^N x_i}{N} = \frac{N-k}{N} \mu_k - \frac{1}{N} \sum_{j=k+1}^N x_i$$ and the variance is $$ Var(\mu_k - \bar{x}) = \frac{N-k}{N^2} \cdot \sigma^2,$$ where I used that the estimated mean $\mu_k$ is known already.

What I find puzzling about this result (assuming that it is correct) is the following: If I fix $k$—the number of samples I use out of $N$—then the expected variance is decreasing in $N$. But shouldn't my uncertainty about the estimated mean be increasing in the samples that I have not seen yet ($N-k$)?

1

There are 1 best solutions below

0
On BEST ANSWER

I would do this with slightly different notation, saving $\mu$ for the mean of the population distribution and capital letters for random variables. Let $\bar x_k$ be the observed mean of the $k$ samples and $\bar X_n$ the unknown mean of the $n$ samples, and so $\bar X_{n-k}$ for mean of the unseen values with $\bar X_n = \frac kn \bar x_k +\frac{n-k}{n} \bar X_{n-k}$ and so $\bar x_k - \bar X_n =\frac{n-k}{n} \bar x_{k}-\frac{n-k}{n} \bar X_{n-k}$. The variance of $\bar X_{n-k}$ is $\frac{\sigma^2}{n-k}$ so the variances of $\frac{n-k}{n} \bar X_{n-k}$ and of $\bar x_k - \bar X_n $ given the value of $\bar x_k$ are both $\frac{n-k}{n^2}{\sigma^2}$. So I agree with your result.

This is decreasing as $n$ increases for $n>2k$ as you say, though increasing when $k < n < 2k$.

This is what you might reasonably expect:

  • $\mathbb E\left[\bar x_k - \bar X_n\right] = \frac{n-k}{n} (\bar x_{k}- \mu)$. It is generally not zero, and (given $\bar x_k$) in absolute terms it is increasing in $n$.

  • When $n$ is close to $k$, $\bar X_n$ is close to $\bar x_{k}$ because most of $\bar X_n$ is determined by $\bar x_{k}$, and the variance of $\bar x_k - \bar X_n $ is low.

  • When $n$ is large, and much larger than $k$, then $\bar X_n$ is likely to be closer to $\mu$ than it is to $\bar x_{k}$ and the variance of $\bar x_k - \bar X_n $ is low $($a little less than $\frac{\sigma^2}{n})$ and falling as $n$ increases, as is usual with ever larger sample sizes. This does not mean that $\bar x_k$ is an increasingly accurate estimate of $\bar X_n$ as $n$ increases: it is usually worse because $\mathbb E\left[\bar x_k - \bar X_n\right]$ is increasing in magnitude as $n$ increases, heading towards $\bar x_k -\mu$.

In a sense this is a variant of the bias-variance tradeoff. You could combine these results and look at the expected square of the error $\mathbb E\left[(\bar x_k - \bar X_n)^2\right] $ $= \left(\frac{n-k}{n}\right)^2(\bar x_{k}-\mu)^2 +\frac{n-k}{n^2}{\sigma^2}$ which is not necessarily monotonic in $n$ but if $\bar x_{k}\not =\mu$ then the left-hand term will eventually dominate and the sum head towards $(\bar x_{k}-\mu)^2$.