Upscalling sample size for bootstrap during resampling

30 Views Asked by At

I started experimenting with bootstrapping and noticed that using a bigger sample size gives a tighter confidence interval, especially at very low sample size.

I made a test and created a bootstrap function that upscale the sample size (during the resampling step). The value it gives is very close than without upscaling but the distribution is less spread out.

Is there a downside to upscale the sample size in such way? Is the CI still valid in that case?

1

There are 1 best solutions below

0
On BEST ANSWER

I commented that using bootstrap samples larger than the original sample size will tend to give you narrower intervals around the original sample mean, and these intervals will not cover the population distribution mean as often as you hope they do.

There is a separate issue that the bootstrap methodology can be overoptimistic (i.e. confidence intervals tend to be too narrow), especially for small original sample sizes.

As an illustration of these, let's try to find bootstrap $95\%$ confidence intervals sampling from a normal distribution, using the following R code. It is not time efficient for large resample sizes, but serves to illustrate the point. In each example, I take $1000$ resamples with replacement from each original sample from a normal distribution to suggest a $95\%$ confidence interval for the mean from that original sample and then see whether that confidence interval covers the population mean; I do this for $1000$ different original samples and so hope about $950$ confidence intervals will cover the population mean and about $50$ will not.

avresample <- function(originalsample,resamplesize){
  return( mean(sample(originalsample, resamplesize, replace=TRUE)) ) 
  }

cicoverspopmean <- function(samplesize, resamplesize, bootcases, 
                            popmean=17, popsd=2, alpha=0.05) {  
  originalsample <- rnorm(samplesize, popmean, popsd)
  bootsims <- replicate(bootcases, avresample (originalsample, resamplesize))  
  ci <- quantile(bootsims, c(alpha/2, 1-alpha/2))
  return(ci[1] <= popmean & ci[2] >= popmean)
  }

set.seed(2024)

For the first example, the original sample sizes are $1000$ and the resample sizes are the same. The confidence intervals achieve close to the intended coverage of the population mean (allowing for simulation noise):

covermeanA <- replicate(1000, cicoverspopmean(
              samplesize=10^3, resamplesize=10^3, bootcases=1000))
table(covermeanA)
# covermeanA
# FALSE  TRUE 
#    47   953 

For the second example, the original sample sizes are $1000$ and the resample sizes are ten times this (this makes the code particularly slow as overall it involves $10^{13}$ individual resamples). The confidence intervals are then too narrow and their coverage of the population mean is a lot less than intended:

covermeanB <- replicate(1000, cicoverspopmean(
              samplesize=10^3, resamplesize=10^4, bootcases=1000))
table(covermeanB)
# covermeanB
# FALSE  TRUE 
#   518   482 

For the third example, the original sample sizes are $10$ and the resample sizes are the same. The overoptimism of the bootstrap method with small sample sizes causes the confidence intervals' coverage of the population mean to be less than intended:

covermeanC <- replicate(1000, cicoverspopmean(
             samplesize=10, resamplesize=10, bootcases=1000))
table(covermeanC)
# covermeanC
# FALSE  TRUE 
#    97   903 

For the fourth example, the original sample sizes are $10$ and the resample sizes are are ten times this. This combination causes coverage of the population mean to be even smaller:

covermeanD <- replicate(1000, cicoverspopmean(
             samplesize=10, resamplesize=10^2, bootcases=1000))
table(covermeanD)
# covermeanD
# FALSE  TRUE 
#   584   416