Is it possible to tell if pairs of values are sampled from the same distribution?

72 Views Asked by At

Let's say I construct two lists, $A$ and $B$, each containing $N$ pairs of values.

For $A$, the $i$th pair of values, $(A_{i,1}, A_{i,2})$, consists of two samples from some arbitrary probability distribution. This distribution is not necessarily the same for each pair. (This means that $A_{i,1}$ and $A_{j,1}$ are NOT sampled from the same distribution)

For $B$, the $i$th pair of values, $(B_{i,1}, B_{i,2})$, consists of one sample each from two arbitrary probability distributions.

If I gave you two lists constructed in this way, could you tell which is which?


Is the fact, that the values $A_{i,1}$ and $A_{i,2}$ come from the same distribution (thus, "correlated" in a way) and that $B_{i,1}$ and $B_{i,2}$ do not, sufficient to distinguish the two lists even for extremely large values of $N$?

What information at minimum is required to distinguish two lists constructed in this way as $N \to \infty$?

2

There are 2 best solutions below

0
On

If you have enough observations, and if the two distributions are sufficiently different, then it should not be difficult to distinguish between the A's and B's.

For both A and B take differences of the pairs. $A_{1i}$ and $A_{2i}$ come from the same distribution, so that the differences $D_{ai}$ should be consistently small.

By contrast $B_{1i}$ and $B_{2i}$ may, at random be from different distributions, so differences $D_{bi}$ will be a mixture of large and small, and hence have a larger variance.

A variance test on the $D_a$ vs. $D_b$ should detect the difference in variances.

set.seed(2020)
n = 20

mu.a = sample(c(10,50), n, rep=T)
a1 = rnorm(n, mu.a, 2)
a2 = rnorm(n, mu.a, 2)
da = a1-a2

mu.b1 = sample(c(10,50), n, rep=T)
mu.b2 = sample(c(10,50), n, rep=T)
b1 = rnorm(n, mu.b1, 2)
b2 = rnorm(n, mu.b2, 2)
db = b1-b2

var(da); var(db)
[1] 8.959197
[1] 595.5409

var.test(da,db)

    F test to compare two variances

data:  da and db
F = 0.015044, num df = 19, denom df = 19, 
  p-value = 3.547e-13
alternative hypothesis: 
  true ratio of variances is not equal to 1
95 percent confidence interval:
  0.005954518 0.038007415
sample estimates:
ratio of variances 
         0.0150438 

Your idea of looking at correlations also seems feasible.

cor(a1, a2)
[1] 0.9903569
cor(b1, b2)
[1] 0.2256975

par(mfrow=c(1,2))
 plot(a1,a2, pch=20)
 plot(b1,b2, pch=20)
par(mfrow=c(1,1))

enter image description here

However, I don't understand the questions about sample size. I don't see how variances become more alike as sample size increases. I ran my code with $n=2000$ instead of $n=20.$ The P-value of var.test changed from nearly $0$ to an output of just $0,$ which probably means a P-value small enough to cause underflow.

And your idea of correlation also works fine with larger samples:

cor(a1,a2)
[1] 0.9902555
cor(b1,b2)
[1] 0.01700688

Notes: (1) My only (and lame) reason for not comparing correlations with a formal test is I didn't want to have to figure out how to do it in R. (2) A Welch t test can't tell the difference between da and db with either sample size.

0
On

No, this is not possible. Consider any two probability density functions $f_1,f_2$. Draw all values for $A$ with density $\frac12(f_1+f_2)$. For each $i$ and $j$, independently uniformly randomly choose $r_{i,j}\in\{1,2\}$, and sample $B_{i,j}$ from $f_{r_{i,j}}$. Then the test has no chance to distinguish the lists (even if $f_1$ and $f_2$ were known), since if you don’t know the $r_{i,j}$ the pairs have identical joint distributions.