Suppose I have a random number generator and I want to check with a Chi Square Test whether its pdf is uniform or no. I can write a script that does that and I will run it several times. To my surprise, I get completely different results each times. Sometimes I will get a p-value of 0.3, sometimes 0.987 and other times 0.003. Which is the number I should take? Should I try to get an average of the p-values I get? How do I decide if this generator passes the test or not?
So what I try next is using a random number generator that I already "know" to be uniform and I run the test on that. And I keep getting absolutely varying results. Even if I increase the number of samples, it keeps varying a lot! From what I understad, the probability if seeing a very low p-value when drawing random samples from a uniform distribution should be low, and the probability of seeing a high p-value should be high. But this doesn't seem to happen.
How should I interpret the results I get from this test?
This is the Python script I am using:
import numpy as np
from scipy.stats import chisquare
bins=256
x = np.random.randint(bins, size=bins*100)
h = [];
for i in range(bins):
h.append(0);
for n in range(0, len(x)):
h[x[n]] += 1;
print(chisquare(h))
and this are the results I get:
Power_divergenceResult(statistic=303.19999999999999, pvalue=0.020572599306529871)
Power_divergenceResult(statistic=211.06, pvalue=0.97933788750272888)
Power_divergenceResult(statistic=289.66000000000003, pvalue=0.066874498546635575)
Power_divergenceResult(statistic=275.63999999999999, pvalue=0.17885688588645363)
Power_divergenceResult(statistic=257.86000000000001, pvalue=0.43814613213884313)
Power_divergenceResult(statistic=217.07999999999998, pvalue=0.95911527563656596)
Even more, I did a histogram of the p-values I got and it looks pretty uniform. If I know the samples I have are drawn from a uniform distribution, shouldn't I get most of the times a high p-value?
If you are vetting a random number generator (RNG), this chi-squared test is one useful criterion. However, there are some additional considerations:
(a) The chi-squared statistic is based on integer counts, and thus is discrete. This means the P-values from the test will not be exactly uniformly distributed in $(0,1).$ [The approximating chi-squared distribution is continuous, but it is the statistic that matters.] Even though the P-values are are not exactly $\mathsf{Unif}(0,1),$ they will take many values throughout $(0,1)$ if the RNG is working as advertised.
(b) Even if the RNG is is 'good', the P-value will be below 0.05 in about 5% of your runs. Thus it is necessary to do many runs (as you have done) before drawing a conclusion about the RNG. You should be suspicious of an RNG that consistently gives P-values below 0.05.
(c) Perhaps surprisingly, you should also be suspicious of an RNG that consistently gives P-values above 0.95 (corresponding to very low chi-squared values). A frequent flaw in 'bad' RNGs is that their behavior is 'too regular'. A chi-squared statistic of 0 arises from perfect fit of observed counts to expected counts. Analogously, you should be suspicious if someone claims to have rolled a fair die 600 times, obtaining exactly 100 instances each of faces 1 through 6. The result seems "too good to be true."
More generally, thoroughly vetting an RNG requires subjecting it to a large number of simulation tasks to see that it gets the same answers predicted by theory. Some of the tests need to be multivariate, because it is possible for an RNG to make a suitable almost uniform-looking histogram in one dimension and yet put all its points on only a few hyperplanes of an $n$-dimensional unit hypercube. [The "Mersenne-Twister," the default RNG in R statistical software, has been vetted up to $n = 623$ dimensions. If you have access to R, type
help .Random.seedandhelp runifin the Session window for explanatory pages.]There are curated 'batteries' of test problems for RNGs. Problems are chosen because they are notoriously difficult for RNGs that are only 'pretty good' to simulate correctly. One useful package of such problems is Marsaglia's "Die Hard," which you can read about on several Internet pages.