Is there any subfield of statistics that focuses on generating synthetic data that are hard to be distinguished from a given sample?
More formally, given $\{x_1, x_2, \cdots, x_n\}$, suppose they are iid sampled from $p(x)$, but $p(x)$ is unknown to us. Can we devise a procedure to generate samples $\{x_1', x_2',\cdots, x_m'\}$ with arbitrary $m$, such that it is hard to distinguish them with two-sample tests? Is there any nontrivial solution that doesn't copy any $x_i$ to $x_j'$?
For example, one solution might be:
- Use point estimate to get a parameterized distribution such that ${x_1, x_2, \cdots, x_n} \approx p_\theta(x)$
- Generate i.i.d. samples from $p_\theta(x)$.
However, is there any theoretical guarantee for this approach?
This is very easy to do for small samples from distributions that are somewhat similar in shape and have the same means and standard deviations. For example, here are ten observations (rounded to three places) from $\mathsf{Gamma}(shape = 3, rate = 1/2),$ hence with mean $\mu = 6, \sigma = \sqrt{12}.$
And here are are ten observations (rounded to three places) from $\mathsf{Norm}(3, sqrt(12)).$
According to a two-sample Kolmogorov-Smirnov test, the distributions from which these samples were chosen cannot be distinguished by looking at the samples.
In spite of the fact that the sample medians differ noticeably (3.476 vs. 6.117) a nonparametric Wilcoxon rank sum test does not find a significant difference between medians at the 5% level of significance.
This is harder to do with larger samples, but not impossible. Of course, there is no guarantee that any two samples (even from the same distribution) would not be found different by some test.
If I knew the purpose of generating such samples, I might be able to give a better answer.