How to estimate population mean by bootstrap sampling?

3.9k Views Asked by At

Let's say I have a population of size 1M, and I took a sample of 10k. For each individual of the 10k sample, I recorded an observation x. Afterwords I subsampled the 10k sample 1000 times with replacements and calculated each subsample's mean(x). Now I have a distribution of 1000 means from the subsamples. I can calculate mean, standard deviation, etc. from this mean distribution. Now, how should I go about estimating the mean of the entire population with a 95% confidence interval?

Is it just mean of the 1k subsamples ± 1.96* sd(1k subsamples)?

additional info: I subsampled the 10k sample 1000 times, each time taking 10k elements from the original sample with replacement.

1

There are 1 best solutions below

2
On BEST ANSWER

Here is some background on nonparametric bootstrapping.

Suppose you have a sample $X_1, \dots, X_n$ from a population with unknown (but assumed existing) mean $\mu.$ If you knew the distribution of $V = \bar X - \mu,$ then you could find lower and upper values $L$ and $U$, respectively, such that $P(L \le \bar X = \mu \le U) = 0.95,$ and after obvious algebraic manipulation $P(\bar X - U \le \mu \le \bar X - L) = .95,$ so that a 95% confidence interval for $\mu$ would be $(\bar X - U, \bar X - L).$

However, because you do not know the distribution of $V$, you enter the 'bootstrap world' to seek suitable estimates of $L$ and $U.$ Here we (temporarily) use $\mu^* = \bar X$ as a proxy for the actual population mean $\mu.$ We take a large number $B$ of re-samples of size $n$ with replacement from the sample, and find $\bar X_i^*$ for each. Then we cut 2.5% from each tail the re-sampled distribution of the $V_i^* = \bar X_i^* - \mu^*$ to get estimates $L^*$ and $U^*$ of $L$ and $U,$ respectively.

Returning to the 'real world' we use $(\bar X - U^*, \bar X - L^*)$ as a 95% bootstrap CI for $\mu.$ Notice that here $\bar X$ has returned to its original role as the sample mean of the original data.


Example (with code): For simplicity, using smaller samples than in your example, I generate a sample of size $n = 200$ from $\mathsf{Norm}(\mu = 50, \sigma = 7)$ to use as (fake) data. Then I take $B = 10,000$ bootstrap re-samples to get an approximate 95% CI for $\mu.$

In the R code below, I use re instead of * to denote quantities based on re-sampling. I have used a for-loop instead of more elegant structures available in R, in case you are not familiar with R. In case you are familiar with R, I have included the seeds I used for the pseudorandom number generator so you can replicate what I have done.

set.seed(1234); n = 200; x = rnorm(n, 50, 7)
a.obs = mean(x);  s.obs = sd(x); pm = c(-1,1)
a.obs + pm*qt(.975, 99)*s.obs/sqrt(n)
## 48.59325 50.59812    # traditional t conf int, assuming normal data
summary(x)
##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  30.01   44.58   48.80   49.60   53.87   71.31 

set.seed(1235)
B = 10^4;  v.re = numeric(B)
for(i in 1:B) {
   a.re = mean(sample(x, n, repl=T))
   v.re[i] = a.re - a.obs }
L = quantile(v.re, .025);  U = quantile(v.re, .975)
a.obs - U; a.obs - L
##    97.5% 
## 48.59839 
##     2.5% 
## 50.59273 

This procedure would protect against bias if the data were from a skewed distribution. It assumes that the empirical CDF of the data approaches the population CDF for a sufficiently large $n.$

In some cases, the bootstrap CI can be a little shorter than the traditional t confidence interval. The t interval assumes normality and so 'contemplates' the existence of possible values in both directions not occurring in our sample of size $n.$ By contrast, the bootstrap CI uses only the data which lie inside $(30.01, 71.31).$


Notes: (a) The idea behind your suggested procedure assumes normal data. It offers no protection against bias from skewed data. Also, the standard deviations $S^*$ of the re-samples will estimate the population SD $\sigma$, so you'd need to use $\bar S^*/\sqrt{n}$ instead.

(b) Your procedure is more like a parametric bootstrap. If you have normal data, I do not see the point of bootstrapping because the traditional t CI would give about the same results--with greater accuracy and less fuss. In my view, the only reason to use a parametric bootstrap would be for data known to be from a distribution other than normal (perhaps Laplace, gamma, or Weibull) where the procedures for exact CIs are computationally messy or may be subject to debate.

If you want to describe in a Comment any doubts you have about the nature of your data, or your specific reason for using bootstrap methods, I would try to respond accordingly.