Generate data that can pass two-sample tests

46 Views Asked by At

Is there any subfield of statistics that focuses on generating synthetic data that are hard to be distinguished from a given sample?

More formally, given $\{x_1, x_2, \cdots, x_n\}$, suppose they are iid sampled from $p(x)$, but $p(x)$ is unknown to us. Can we devise a procedure to generate samples $\{x_1', x_2',\cdots, x_m'\}$ with arbitrary $m$, such that it is hard to distinguish them with two-sample tests? Is there any nontrivial solution that doesn't copy any $x_i$ to $x_j'$?

For example, one solution might be:

  • Use point estimate to get a parameterized distribution such that ${x_1, x_2, \cdots, x_n} \approx p_\theta(x)$
  • Generate i.i.d. samples from $p_\theta(x)$.

However, is there any theoretical guarantee for this approach?

1

There are 1 best solutions below

5
On

This is very easy to do for small samples from distributions that are somewhat similar in shape and have the same means and standard deviations. For example, here are ten observations (rounded to three places) from $\mathsf{Gamma}(shape = 3, rate = 1/2),$ hence with mean $\mu = 6, \sigma = \sqrt{12}.$

set.seed(1234);  x = round(rgamma(10, 3, 1/2), 3);  x
## 1.911 6.044 6.201 2.957 6.728 3.348 8.593 3.225 3.605 0.996
summary(x); sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.996   3.024   3.476   4.361   6.162   8.593 
## 2.401753

And here are are ten observations (rounded to three places) from $\mathsf{Norm}(3, sqrt(12)).$

set.seed(4321);  y = round(rnorm(10, 6, sqrt(12)), 3); y
##  4.522  5.225  8.486  8.915  5.555 11.575  4.971  6.679 10.298  3.510
summary(y); sd(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 3.510   5.034   6.117   6.974   8.808  11.580 
## 2.696809

According to a two-sample Kolmogorov-Smirnov test, the distributions from which these samples were chosen cannot be distinguished by looking at the samples.

ks.test(x, y)

        Two-sample Kolmogorov-Smirnov test

data:  x and y 
D = 0.5, p-value = 0.1678
alternative hypothesis: two-sided 

In spite of the fact that the sample medians differ noticeably (3.476 vs. 6.117) a nonparametric Wilcoxon rank sum test does not find a significant difference between medians at the 5% level of significance.

wilcox.test(x, y)

        Wilcoxon rank sum test

data:  x and y 
W = 24, p-value = 0.05243
alternative hypothesis: true location shift is not equal to 0 

This is harder to do with larger samples, but not impossible. Of course, there is no guarantee that any two samples (even from the same distribution) would not be found different by some test.

If I knew the purpose of generating such samples, I might be able to give a better answer.