Why can we simply pool the realized observations in a permutation test?

162 Views Asked by At

Let a vector of i.i.d random varibles $(X_1,X_2,X_3,\cdots, X_m)$ and another vector of i.i.d rvs $(Y_1,Y_2,Y_3,\cdots, Y_n)$ be given. Suppose $X_i$ stands for the recovery time using a new treatment and $Y_i$ be the recovery time with the old treatment. Suppose we are interested in running a hypothesis test with the following null hypothesis $\mathbb{E}[X_i]=\mathbb{E}[Y_i]$ and alternatiave hypothesis $\mathbb{E}[X_i]\neq\mathbb{E}[Y_i]$. Now, we wish to estimate the null distribution using the permutation distribution. This is when I fail to understand: we then say under the null hypothesis, $X_i$ and $Y_i$ are identically distributed and then pool the r.v.s $X_i$ and $Y_i$ and find the permutation distribution.

Why do we know $X_i$ and $Y_i$ are identically distributed under the null? I mean, there mean is the same does not mean they share the same distribution at all. Help is appreciated. (Maybe I misunderstood what my textbooks are saying, if so, please point out my mistake.)

1

There are 1 best solutions below

0
On

You are correct to wonder what it means to pool two samples. Suppose we have two different treatments and apply each to a random sample of 10 subjects from a certain population. Do the treatments have different effects?

If the null hypothesis is that the two population means are equal, $and$ the 'metric' we use judges whether means are the same, then pooling is OK. But if the metric we use judges something else, then maybe not.

Here are two samples generated from populations with the same mean.

 x:  103 128  49 126 130 113 100  88 116 110
 y:   94  96 101 104  98  90  86 101  92  93

A permutation test to check whether population means $\mu_x$ and $\mu_y$ are equal might use the difference in sample means $d = \bar X - \bar Y$ or the usual two-sample t test. Here is a simulated permutation test based on the difference in means. The observed difference $\bar X - \bar Y$ is 10.8. The question is whether this difference is significantly different than 0.

 obs.dif = mean(x) - mean(y);  obs.dif
 ## 10.8
 B = 10^4;  perm.dif = numeric(B)
 for(i in 1:B) {
   perm = sample(c(x,y), 20)
   perm.dif[i] = mean(perm[1:10]) - mean(perm[11:20]) }
 mean(abs(perm.dif) > abs(obs.dif))
 ## 0.1783

The P-value of the simulated permutation test is about 18%, which is hardly evidence that 10.8 is significant. The histogram below shows the simulated absolute permuted differences in means. The vertical line is the observed value for the original two samples.

enter image description here

A Welch (separate differences) t test gives a comparable P-value.

 t.test(x, y, alternative="two.sided", mu=0)

         Welch Two Sample t-test

 data:  x and y 
 t = 1.3787, df = 9.96, p-value = 0.1982
 alternative hypothesis: true difference in means is not equal to 0 
 sample estimates:
 mean of x mean of y 
     106.3      95.5 

Now let's suppose that by saying there is no effect we mean that the standard deviations are the same. Suppose we use the difference in the sample standard deviations as the 'metric'. The observed difference in sample SDs is 18.55. Is that significantly different from 0?

 obs.dif = sd(x) - sd(y);  obs.dif
 ## 18.55140
 B = 10^4;  perm.dif = numeric(B)
 for(i in 1:B) {
     perm = sample(c(x,y), 20)
     perm.dif[i] = sd(perm[1:10]) - sd(perm[11:20]) }
 mean(abs(perm.dif) > abs(obs.dif))
 ## 0.0021

In this case the P-value is quite small, and we conclude the treatment does make a difference. (A standard F-test also finds a difference in variances, based on a very small P-value.) The figure below is similar to the previous one, but for absolute differences is SDs.

enter image description here

So it matters how we measure whether the treatments produce different results. If the metric is difference in means, then separate and permuted samples do not behave differently. But if our metric is difference in standard deviations, then separate and permuted samples seem to behave quite differently.


One can argue that both permutation tests are legitimate. But we need to make sure that the way we frame the null and alternative hypotheses---and the way we measure whether there is an 'effect'---both make practical sense.

Note: (1) The ten $X_i$ were simulated from $Norm(\mu=100, \sigma=25)$ (by chance there is an outlier) and the ten $Y_i$ were simulated from $Norm(100, 5).$ All 20 observations were rounded to integers. Of course in a practical situation, we would not have such information about the two populations. (2) For more on permutation tests, see this paper in $J.\; Statistics\; Educ.$ by Eudey, et al.