I am attempting to characterize some random number generator programs in a very simple way. Specifically, I'm rolling a simulated 6-sided die $3 \times 10^8$ times and keeping a count of how many times each of the six possible outcomes happens. I have used a $\chi^2$ test for each of the different random number generators. That is, for all of the observation counts $O_i$ and the expected value of $E=3\times10^8/6$ I calculate: $$ \chi^2 = \sum_{i=1}^{6}\frac{(O_i-E)^2}{E}$$ This value is then compared to 20.52, which, if I've understood correctly is the correct value for $p=0.001$ and 5 degrees of freedom as in this chart.
All of the random number generators pass this test, but I notice that some of them look a little too good. That is, the value of $\displaystyle \frac{(O_i-E)^2}{E}$ is less than 0.1 for each observation count.
I have looked at the fifteen NIST random number generator tests but I'd rather not run all of those tests and don't know which (if any) of those tests captures the notion I'm trying to convey.
So my questions is: Is there a statistically valid test that expresses this notion of "too good" in this context?
Let $D_i$ be the result of die roll $i$. Note that each die face has $\frac{1}{6}$ chance of occurring, so lets focus on the statistical properties of $\frac{(O_i-E_i)^2}{E_i}$. For a given value of $N=$ number of die rolls, let $Y^i_N=\sum\limits_{j=1}^N \mathbb{I}_{i}(D_i)$, where $\mathbb{I}_{i}(D_i)=1$ iff $D_i=i$ and is $0$ otherwise.
Therefore, $Y_N^i\sim Bin(N,\frac{1}{6})$. From the central limit theorem we know that:
$Y_N^i\xrightarrow{d} \mathcal{N}(Np_i,Np_i(1-p_i))$
From the above we can see that for a given $N$ we have $O_i=Y_N^i$ thus $(O_i-E_i)\sim \mathcal{N}(0,Np_i(1-p_i))\implies \frac{(O_i-E_i)}{\sqrt{E_i}}\sim \mathcal{N}(0,1-p_i)$
If we square this result, we can conclude that $Var\left(\frac{(O_i-E_i)^2}{E_i}\right)\leq Var(\chi^2_1) = 1$ and $P(\chi^2_1\leq 0.1)=25\%$ So getting any single term to be so low is not unusual.
However, if, as you say, every term is $\leq 0.1$, then $P(\chi^2_5\leq 0.5)<0.001$
Here's what you can try: Generate $N>50$ runs from each RNG you are testing, each with a different starting seed. For each run, perform your Chi-square test and look at the distribution of resulting p-values - they should look uniformly distributed on $[0,1]$. If you want to be a bit more rigorous, you can perform a Komolgorov-Smirnov goodness of fit test using $H_0=$Standard Uniform distribution for each RNG's p-values to see if they are sufficiently uniform. If so, then what you saw was a statistical anomaly. If the p-values are skewed, then something weird is going on.