From test statistics to datasets

45 Views Asked by At

Suppose we are given an initial data set $x$. From this dataset, we are able to compute with an appropriate hypothesis test a test statistic (i.e. with a One-Sample T-Test we would get the $t$ statistic).

Now what I am looking for is a way, where by changing the test statistic $t$ to a different value $t_{new}$, generate from it as many new datasets $x^*$, that are still constrained by a descriptive statistic of the original dataset, like the mean $\overline{x}$ or the standard deviation $\sigma_{x}$.

I am currently studying a method using simulated annealing, which is flexible enough to include as many constraints as possible, but it lacks the property of speed. Two other algorithm classes that I am looking into are an evolutionary programming approach and the one I am most curious about is an adversarial approach using Constrained GANs.

My assumption is that one could train a network for a specific test only, by feeding it a dataset of data together with the output test statistic $t$ and teach the generator to produce data, where the computed test statistic $t_{comput}$ would be relatively close to the inputted test statistic $t_{input}$.

$$ t_{comput} \simeq t_{input} $$

I am curious if any of you have had a similar problem, have any tips for me or would suggest a different approach altogether?

Thank you for your help.

1

There are 1 best solutions below

5
On BEST ANSWER

I'm not sure I understand what you want or why. So my first guess may be far too simple.

Suppose we test $H_0:\mu = 30$ against $H_1: \mu \ne 30$ with $n = 100$ observations from $\mathsf{Norm}(\mu = 35, \sigma = 5).$

In R, I generate data x in R statistical software rounded to four decimal places for an example. I show the seed so you can get exactly the same dataset if you wish:

set.seed(4318);  x = round(rnorm(100, 35, 5), 4)
a = mean(x);  s = sd(x);  t = (a - 30)*sqrt(100)/s
a;  s;  t
[1] 35.53788
[1] 5.054078
[1] 10.95725

enter image description here

Now I want to constrain the situation to keep $a = 35.53788, t = 10.95725, n = 100$ and hypothetical mean 30. If I want a new dataset meeting these constraints, it must also have sample standard deviation $s = 5.054078.$

To do this; I sample z of length 100 from standard normal.

set.seed(1234); z = rnorm(100);  a.z = mean(z);  s.z = sd(z)
y = z*s/s.z;  x1 = y - mean(y) + a
a1 = mean(x1);  s1 = sd(x1);  t1 = (a1 - 30)*sqrt(100)/s1
a1;  s1;  t1  
[1] 35.53788
[1] 5.054078
[1] 10.95725

Thus, simply by rescaling and shifting z, I have gotten a new data vector x1 with exactly the same t statistic as x. If I want it rounded to three places, that does not change the t statistic by much.

x1.r = round(x1, 3);  t.r = (mean(x1.r) - 30)*sqrt(100)/sd(x1.r)
sd(x1.r);  t.r
[1] 5.054076
[1] 10.95721

enter image description here

This seems too simple, so I await your comment.