Vetting a random number generator -- Chi Square Tests throws varying p-value results

1.2k Views Asked by At

Suppose I have a random number generator and I want to check with a Chi Square Test whether its pdf is uniform or no. I can write a script that does that and I will run it several times. To my surprise, I get completely different results each times. Sometimes I will get a p-value of 0.3, sometimes 0.987 and other times 0.003. Which is the number I should take? Should I try to get an average of the p-values I get? How do I decide if this generator passes the test or not?

So what I try next is using a random number generator that I already "know" to be uniform and I run the test on that. And I keep getting absolutely varying results. Even if I increase the number of samples, it keeps varying a lot! From what I understad, the probability if seeing a very low p-value when drawing random samples from a uniform distribution should be low, and the probability of seeing a high p-value should be high. But this doesn't seem to happen.

How should I interpret the results I get from this test?

This is the Python script I am using:

import numpy as np
from scipy.stats import chisquare

bins=256
x = np.random.randint(bins, size=bins*100)

h = [];
for i in range(bins):
  h.append(0);

for n in range(0, len(x)):
  h[x[n]] += 1;

print(chisquare(h))

and this are the results I get:

Power_divergenceResult(statistic=303.19999999999999, pvalue=0.020572599306529871)
Power_divergenceResult(statistic=211.06, pvalue=0.97933788750272888)
Power_divergenceResult(statistic=289.66000000000003, pvalue=0.066874498546635575)
Power_divergenceResult(statistic=275.63999999999999, pvalue=0.17885688588645363)
Power_divergenceResult(statistic=257.86000000000001, pvalue=0.43814613213884313)
Power_divergenceResult(statistic=217.07999999999998, pvalue=0.95911527563656596)

Even more, I did a histogram of the p-values I got and it looks pretty uniform. If I know the samples I have are drawn from a uniform distribution, shouldn't I get most of the times a high p-value?

2

There are 2 best solutions below

0
On BEST ANSWER

If you are vetting a random number generator (RNG), this chi-squared test is one useful criterion. However, there are some additional considerations:

(a) The chi-squared statistic is based on integer counts, and thus is discrete. This means the P-values from the test will not be exactly uniformly distributed in $(0,1).$ [The approximating chi-squared distribution is continuous, but it is the statistic that matters.] Even though the P-values are are not exactly $\mathsf{Unif}(0,1),$ they will take many values throughout $(0,1)$ if the RNG is working as advertised.

(b) Even if the RNG is is 'good', the P-value will be below 0.05 in about 5% of your runs. Thus it is necessary to do many runs (as you have done) before drawing a conclusion about the RNG. You should be suspicious of an RNG that consistently gives P-values below 0.05.

(c) Perhaps surprisingly, you should also be suspicious of an RNG that consistently gives P-values above 0.95 (corresponding to very low chi-squared values). A frequent flaw in 'bad' RNGs is that their behavior is 'too regular'. A chi-squared statistic of 0 arises from perfect fit of observed counts to expected counts. Analogously, you should be suspicious if someone claims to have rolled a fair die 600 times, obtaining exactly 100 instances each of faces 1 through 6. The result seems "too good to be true."

More generally, thoroughly vetting an RNG requires subjecting it to a large number of simulation tasks to see that it gets the same answers predicted by theory. Some of the tests need to be multivariate, because it is possible for an RNG to make a suitable almost uniform-looking histogram in one dimension and yet put all its points on only a few hyperplanes of an $n$-dimensional unit hypercube. [The "Mersenne-Twister," the default RNG in R statistical software, has been vetted up to $n = 623$ dimensions. If you have access to R, type help .Random.seed and help runif in the Session window for explanatory pages.]

There are curated 'batteries' of test problems for RNGs. Problems are chosen because they are notoriously difficult for RNGs that are only 'pretty good' to simulate correctly. One useful package of such problems is Marsaglia's "Die Hard," which you can read about on several Internet pages.

1
On

Take a look at this cross validated post. With this you can answer half of your questions, but I will point things out.

The p value is a function of random data and so a random variable on its own. Thus, you expect it to vary with some distribution (as you discovered experimentally). When the null hypothesis is true (in this case that the data does come from the uniform distribution), you expect every p value to be equally likely i.e p to be uniformly distributed. Thus you are getting exactly the result that you should expect in your second example! Moreover, if the p value is uniformly distributed, you will most of the time get a "high" p value (for a reasonable value of high)!

As to what you should do with the p values and how you should interpret them/decide if the generator passes/fails, check out this other cross validated post. In short, you can just average them and interpret the result as one p value, but there are other more natural things that you can do that are typically better.