I have two sets of observations, the brain size $(x)$ and body size $(y)$ of various animals. I am constructing a bootstrap estimate of the correlation.
In order to get the bootstrap estimates, I took a random sample (with replacement) from both $x$ and $y$, then computed the correlation, and repeated $10000$ times. However apparently this is wrong, and what I should have done is randomly selected one observation from each of $x$ and $y$, then repeated this $10000$ times.
Can someone explain why my approach gives the wrong answer? Why do I only need one sample from $x$ and $y$ each times?
I would do a nonparametric bootstrap for the correlation $r$ as follows:
Data for demonstration. First, some data. In R I generate $n = 100$ normally distributed pairs $(X,Y)$ with theoretical correlation $\rho=0.70.$ In R, you can reclaim the exact data I am using by starting the sampling with the same seed.
Here is a plot of the data.
Nonparametric bootstrap procedure. Now for the bootstrap. [I am ignoring that the data are jointly normally distributed. Using that assumption, one could use one of the established mathods to get a 95% CI for $\rho.$]
If I knew the distribution of the fraction $r/\rho,$ I could use it to find values $L$ and $R$ that cut 2.5% of the probability from the left and right tails of the distribution, respectively, to get $$P(L < r/\rho < U) = P\left(\frac r U < \rho < \frac r L\right) = .95,$$ so that a 95% CI for $\rho$ would be $\left(\frac r U,\, \frac r L\right).$
"Bootstrap world" for approximation. Absent specific distributional information, I use a bootstrap procedure to get approximations $L^*$ of $L$ and $U^*$ of $U.$ Entering the so-called 'bootstrap world', I temporarily use $r_{obs} = 0.6997$ a proxy for the unknown $\rho.$ Then, at the $i$th bootstrap step, I choose $n = 100$ pairs $(X_j,Y_j)$ at random with replacement, find the 're-sampled' correlation $r_i^*$, and find the re-sampled fraction $f_i^* = \frac{r_i^*}{r_{obs}}.$ After a large number $B$ of steps, I find quantiles ,025 $L^*$ and .975 $U^*$ of the $B$ values $f_i^*.$
"Real world" and CI. Returning to the 'real world' (with $r_{obs}$ once again in its role as an estimate), I find the 95% nonparametric bootstrap $\left(\frac{r_{obs}}{U^*}, \frac{r_{obs}}{L^*}\right).$
R code for nonparametric bootstrap CI for $\rho.$ Code for this procedure in R statistical software is shown below. [I have used suffixes
.reinstead of superscript $*$'s in the code. Unless you keep myset.seedstatement, you will get a slightly different result on each run of the program.]The resulting 95% nonparametric bootstrap CI is $(.620, .826),$ which contains the point estimate $r_{obs} = .6997,$ as it must. But the point estimate is typically not at the center of the CI.
Comments. This method of bootstrapping keeps ratios intact throughout and reduces the chances for a biased result. In the presence of bias or lack of symmetry (as here), I believe it would be incorrect to use quantiles of the re-sampled correlations $r_*$ themselves to make a confidence interval.
If you use one of the standard parametric CIs for $\rho,$ assuming that the $(X,Y)$ pairs are bivariate normal, then you may get a shorter CI. One reason you might get a different result is that assuming bivariate normality provides information that the nonparametric bootstrap does not use.
A histogram of the bootstrap distribution of the fraction $f_i^*$ is shown below along with the locations of $L^*$ and $U^*.$