Nonparametric Boostrap Confidence Interval For $\text{Var }(\overline X)$

60 Views Asked by At

Let $\overline X$ denote the sample mean. If we want to find its variance, we have $\text{var } \overline X = \sigma^2/n$. Now, if we do not have $\sigma^2$ we can instead use:

$$\hat \sigma^2 \ \ = \ \ \tfrac{1}{n-1} \sum_{i=1}^{\, n} (X_i - \overline X)^2$$

My question is why are we going through bootstrapping to estimate $\text{var } \overline X$ when there is already a nice and neat expression? In bootstrapping, we re-sample from the original data set to obtain:

$$\tfrac{1}{m} \sum_{i=1}^{\, m} \left( \, \overline X_i - \tfrac{1}{m} \sum_{k=1}^{\, m} \overline X_k \, \right)^2$$

This expression involves $m$ copies of $(X_1, X_2 \ldots X_n)$ whereas in the first formula, I have only one copy of $(X_1, X_2 \ldots X_n)$.

1

There are 1 best solutions below

0
On BEST ANSWER

The reason for bootstrapping would ordinarily be to get a confidence interval for $Var(\bar X) = \sigma^2/n,$ where $\bar X$ is based on a random sample of size $n$ from a population with variance $\sigma^2.$

You do not know the value of $\sigma^2$ so you can't compute the exact value of $Var(\bar X) = \sigma^2/n.$ As you say, you can estimate $\sigma^2$ by $S^2 =\hat \sigma^2 = \frac{1}{n-1}\sum_{i-1}^n (X_i - \bar X)^2.$ But you don't know how close the estimate $\hat \sigma^2$ is to $\sigma^2$ itself. [Often $\hat \sigma^2$ is written as $S^2.$]

If you knew that the data were normal, then you could use the fact that $$\frac{(n-1)S^2}{\sigma^2} =\frac{(n-1)\hat\sigma^2}{\sigma ^2} \sim \mathsf{Chisq}(n-1)$$ to get a 95% confidence interval for $\sigma ^2$ of the form $$\left(\frac{(n-1)S^2}{U},\frac{(n-1)S^2}{L}\right),$$ where $L$ cuts 2.5% of the probability from the lower tail of $\mathsf{Chisq}(n-1)$ and $U$ cuts 2.5% of the probability from its upper tail, so that $$P\left(L < \frac{(n-1)S^2}{\sigma^2} < U\right) = P\left(\frac 1 U < \frac{\sigma^2}{(n-1)S^2} < \frac 1 L\right)\\ = P\left(\frac{(n-1)S^2}{U} < \sigma^2 < \frac{(n-1)S^2}{L}\right) = 0.95$$

For data that are not known to be normal, these relationships are not necessarily true. The bootstrap 're-sampling' procedure is a computationally intensive way to get values $L$ and $U$ that apply to the distribution from which your data $\mathbf{x} =(X_1, X_2, \dots, X_n)$ were randomly sampled.

Specifically, by taking $m$ 're-samples' each of size $n$ from $\mathbf{x}$ with replacement, you can get $m$ values of your last displayed expression. Then by taking quantiles 0.025 and 0.975 of that collection of $m$ values, you can get good estimates of $U$ and $L$ corresponding to your data, and thence approximated CIs for $\sigma^2$ and $\sigma^2/n.$

Note: If you bootstrap normal data, you will not get exactly the same CI as from using the chi-squared distribution. There are two reasons. First, the bootstrap procedure is only an approximation by simulation. Second, the bootstrap procedure does not use the information that the data are normal; that information is valuable and helps get a more accurate CI. Your bootstrap is sometimes called 'nonparametric' to stress that no assumption about the population being normal (or of any other particular type) is used.