Bootstrap estimate of correlation

134 Views Asked by At

I have two sets of observations, the brain size $(x)$ and body size $(y)$ of various animals. I am constructing a bootstrap estimate of the correlation.

In order to get the bootstrap estimates, I took a random sample (with replacement) from both $x$ and $y$, then computed the correlation, and repeated $10000$ times. However apparently this is wrong, and what I should have done is randomly selected one observation from each of $x$ and $y$, then repeated this $10000$ times.

Can someone explain why my approach gives the wrong answer? Why do I only need one sample from $x$ and $y$ each times?

2

There are 2 best solutions below

0
On BEST ANSWER

I would do a nonparametric bootstrap for the correlation $r$ as follows:

Data for demonstration. First, some data. In R I generate $n = 100$ normally distributed pairs $(X,Y)$ with theoretical correlation $\rho=0.70.$ In R, you can reclaim the exact data I am using by starting the sampling with the same seed.

set.seed(224)
e = rnorm(100, 0, 15);  x = rnorm(100, 100, 15);  y = x + e
r.obs = cor(x, y);  r.obs
## 0.6996899
plot(x,y, pch=20)

Here is a plot of the data.

enter image description here

Nonparametric bootstrap procedure. Now for the bootstrap. [I am ignoring that the data are jointly normally distributed. Using that assumption, one could use one of the established mathods to get a 95% CI for $\rho.$]

If I knew the distribution of the fraction $r/\rho,$ I could use it to find values $L$ and $R$ that cut 2.5% of the probability from the left and right tails of the distribution, respectively, to get $$P(L < r/\rho < U) = P\left(\frac r U < \rho < \frac r L\right) = .95,$$ so that a 95% CI for $\rho$ would be $\left(\frac r U,\, \frac r L\right).$

"Bootstrap world" for approximation. Absent specific distributional information, I use a bootstrap procedure to get approximations $L^*$ of $L$ and $U^*$ of $U.$ Entering the so-called 'bootstrap world', I temporarily use $r_{obs} = 0.6997$ a proxy for the unknown $\rho.$ Then, at the $i$th bootstrap step, I choose $n = 100$ pairs $(X_j,Y_j)$ at random with replacement, find the 're-sampled' correlation $r_i^*$, and find the re-sampled fraction $f_i^* = \frac{r_i^*}{r_{obs}}.$ After a large number $B$ of steps, I find quantiles ,025 $L^*$ and .975 $U^*$ of the $B$ values $f_i^*.$

"Real world" and CI. Returning to the 'real world' (with $r_{obs}$ once again in its role as an estimate), I find the 95% nonparametric bootstrap $\left(\frac{r_{obs}}{U^*}, \frac{r_{obs}}{L^*}\right).$

R code for nonparametric bootstrap CI for $\rho.$ Code for this procedure in R statistical software is shown below. [I have used suffixes .re instead of superscript $*$'s in the code. Unless you keep my set.seed statement, you will get a slightly different result on each run of the program.]

The resulting 95% nonparametric bootstrap CI is $(.620, .826),$ which contains the point estimate $r_{obs} = .6997,$ as it must. But the point estimate is typically not at the center of the CI.

set.seed(218)
B = 10000;  f.re = numeric(B)
for(i in 1:B) {
  ix = sample(1:100, 100, rep=T)  # select 100 indexes of re-sampled pairs
  r.re = cor(x[ix],y[ix])         
  f.re[i] = r.re/r.obs }
L.re = quantile(f.re, .025);  U.re = quantile(f.re, .975)
r.obs/U.re; r.obs/L.re
    97.5% 
0.6200286 
     2.5% 
0.8262627 

Comments. This method of bootstrapping keeps ratios intact throughout and reduces the chances for a biased result. In the presence of bias or lack of symmetry (as here), I believe it would be incorrect to use quantiles of the re-sampled correlations $r_*$ themselves to make a confidence interval.

If you use one of the standard parametric CIs for $\rho,$ assuming that the $(X,Y)$ pairs are bivariate normal, then you may get a shorter CI. One reason you might get a different result is that assuming bivariate normality provides information that the nonparametric bootstrap does not use.

A histogram of the bootstrap distribution of the fraction $f_i^*$ is shown below along with the locations of $L^*$ and $U^*.$

enter image description here

0
On

You need each observation to have an $x$ and $y$ value. Consider a scatterplot. If there is a strong, linear association, you'll see a linear pattern in the data. But in order to create a scatterplot, you need an $x$ and $y$ coordinate for each observation.