Bootstrap method & Confidence Interval

360 Views Asked by At

I'm trying to figure out how this method works. My data:

  • 1000 samples from unknown distribution.
  • I need to create 40 vectors from those 1000 samples (each vector with 20 samples)
  • For every one of the 40 vectors, I need to do the bootstrap method for:
    • Finding the confidence interval ($\alpha$ = 0.05) in three methods: t, quantiles & normal.
    • We need the confidence interval for the standard deviation.

(R langauge)

My way until now:

  • I've created this 40 vectors (each one with 20 samples)
  • Let's say that the bootstrap constant is 1000.
  • What is actually the process of "doing bootstrap" for each vector with 20 samples? How can we create a confidence interval for this vector in each one of these methods I've mentioned?

I will be glad for any help.

2

There are 2 best solutions below

2
On

@BruceET

I cannot comment because i've needed to reset my account. Anyway, I'm interested only in CI for the population standard mean. By 3 methods, I've meant that there are 3 options to calculate the CI (one with quantiles, one with t distribution, and one with normal distribution).

I wanna provide R code, but that's my challange to understand what is the code :)

0
On

Here is one example of finding a 95% nonparametric bootstrap confidence interval for the population standard deviation (SD) $\sigma,$ based on a sample x of size $n = 20$ from an unknown population distribution.

 x
 [1] 240 314 354 183 321 325 271 273 272 255
[11] 276 250 261 303 348 294 274 254 258 421
s.obs = sd(x);  s.obs
50.67365

If we knew the population distribution, we could find the distribution of the ratio $R = S/\sigma$ based on the population distribution. Then we could find values $L$ and $U$ that cut 2.5% from the lower tails, respectively, of the distribution of $R$ so that

$$0.95 = P(L \le R \le U) = P\left(\frac{S}{U} \le \sigma \le \frac{S}{L}\right),$$

where $S$ is the sample SD of the sample of $n = 20.$ Then the desired 95% CI would be $(S/U,\,S/L).$

However, we do not know the distribution of $R$ and we seek to estimate $L$ and $U$ by using a bootstrap method.

Enering the so-called bootstrap world, we take $B = 1000$ re-samples from x, each of them a re-sample of size $n = 20$ taken with replacement from x. Temporarily, we take the observed SD ($S_{obs} = 50.67365$) as a proxy for the unknown population SD $\sigma;$ that is $\sigma^* = 50.67365.$ Then, for each of the $B$ re-samples, we find $R^* = S^*/\sigma^*.$ We find quantiles .025 and .975 of the $B$ values $R^*$ as estimates $L^*$ of $L$ and $U^*$ of $U,$ respectively. [Notice that quantities referring to re-sampling are denoted by $*$'s.

Back in the real world, we find the 95% nonparametric bootstrap CI of $\sigma$ as $(S_{obs}/U^*, S_{obs}/L^*).$ [Here $S_{obs}$ returns to its original role as the observed SD of our sample x.]

The R code for this procedure follows. In the code we use -re instead of $*$.

B = 1000;  n = length(x);  sg.re = s.obs;  r.re = numeric(B)
for (i in 1:B) {
   x.re = sample(x, n, repl=T);  s.re = sd(x.re)
   r.re[i] = s.re/sg.re  }
L.re = quantile(r.re, .025);  U.re = quantile(r.re, .975)
LCL = s.obs/U.re;  UCL = s.obs/L.re
c(LCL, UCL)
   97.5%     2.5% 
37.33546 88.26224 

So the 95% nonparametric bootstrap CI for $\sigma$ is $(37.3,\,88.3).$ Because this is a simulation procedure, subsequent runs may give slightly different results. My second run of the program above gave slightly different results that still round to $(37.3,\,88.3).$


Now it is time for a confession: I generated x from a normal distribution as follows:

set.seed(1234); x = round(rnorm(20, 300, 50))

So I know that the data are normal. The standard 95% CI for $\sigma$ of a normal population is $\left(\sqrt{\frac{(n-1)S^2}{U_q}}, \sqrt{\frac{(n-1)S^2}{L_q}}\right),$ where $L_q$ and $R_q$ are quantiles .025 and .975, respectively, of $\mathsf{Chisq}(\nu = n-1).$ So the traditional parametric 95% CI for $\sigma$ is $(38.5, 70.0)$. Because knowing the population distribution introduces new and useful information into the process of estimation, we cannot expect the normal-based CI to be the same as the bootstrap CI, but they are not much different for practical purposes.

sqrt((n-1)*var(x) / qchisq(c(.975,.025), n-1))
[1] 38.53682 74.01249