Expected standard deviation of random binary data

79 Views Asked by At

I have a sequence of bytes, i.e., discrete integer values in the range 0..255. The length of the sequence can be fairly short (1000s of bytes) but also relatively long (millions or billions of bytes). I want to perform simple statistical test how probable the claim is that these numbers were generated from a fair random number generator (i.e., uniformly distributed).

Firstly, I want to validate if the mean value is plausibly random. The mean value I expect of these values is 127.5. But at what "cutoff" values becomes it which probability that this is still expected? In other words, if I get a mean value of 127.72, this is still "quite" plausibly still a good generator, whereas if the mean of a million numbers was 20, it is most likely broken. How do I start to quantify this?

Secondly, I want to do the same with the standard deviation. I do believe the expected value would be

$\sqrt \frac{21845}{4} = \sqrt\frac{2796160}{256} \approx 73.900$

But I am not entirely sure and much less know what the "plausible" range would be.

How do I start quantifying this in a more systematic way so that I can say, for example, "we can be 95%, 99%, 99.9% confident that the generator was producing uniformly distributed bytes"?

1

There are 1 best solutions below

1
On BEST ANSWER

Let's describe a much simpler example to illustrate the problems with your proposed approach.

Say I have two random number generators, call them Generator $A$ and $B$, both of which only generate numbers in the set $\{0, 1, 2\}$. They each generate a sequence $\{A_1, A_2, A_3, \ldots \}$ and $\{B_1, B_2, B_3, \ldots \}$ of some arbitrary length.

Generator $A$ actually obeys $$\Pr[A_i = 0] = \Pr[A_i = 1] = \Pr[A_i = 2] = \frac{1}{3}$$ for each $i = 1, 2, \ldots$. But Generator $B$ instead obeys $$\Pr[B_i = 0] = \frac{1}{200}, \quad \Pr[B_i = 1] = \frac{99}{100}, \quad \Pr[B_i = 2] = \frac{1}{200}.$$

If you calculate the sample mean of a sequence from each generator, their expected values will both be $1$. In other words, if I generated a sequence of length $n = 1000$ from $A$, calculated the total, and divided by $1000$, I would get a number reasonably close to $1$. But I would also get this result if I did the same with $B$. In fact, you can see that because realizations from $B$ are equal to $1$ $99\%$ of the time, the sample mean of a sequence from $B$ will tend to be very close to $1$, much closer than for $A$.

And even worse, if you calculated the sample mean for Generator $C$ which obeys $$\Pr[C_i = 0] = \frac{1}{2}, \quad \Pr[C_i = 1] = 0, \quad \Pr[C_i = 2] = \frac{1}{2},$$ you will also get a value that is around $1$--but $C$ has zero probability of actually being $1$.

This shows how tests of the uniformity of a distribution cannot be performed using the sample mean as a test statistic. In other words, in order to test whether your generator is "fair"--i.e., each outcome is equally likely as any other outcome--you cannot simply take the sample mean because that won't tell you the shape of the distribution, only the long-run average.

In order to formulate a test of uniformity, what you need to do is count the frequency of each type of outcome. If $X$ is a uniformly distributed random variable that can take on values in $\{0, 1, 2, \ldots, 255\}$, then for each such outcome, we must have $$\Pr[X = x] = \frac{1}{256}, \quad x \in \{0, 1, \ldots, 255\}.$$ Then when we generate a sample of size $n$, we would expect that $n/256$ of those observations will equal $0$; $n/256$ will equal $1$; and so forth. Clearly, the larger the number of possible outcomes, the larger your sample will need to be in order to detect deviations from uniformity.

So going back to the simple example, suppose we generated from $A$ the sample $$(0, 1, 0, 0, 2, 2, 1, 2, 0, 1).$$ This has size $n = 10$ and we have four $0$'s, three $1$'s, and three $2$'s. We can also write this more compactly as $(4,3,3)$, where the first value represents the number of $0$'s, the second represents the number of $1$'s, and the third represents the number of $2$'s. How do we take this and quantify its deviation from uniformity?

Well, we can perform an exact test by noticing that test statistic $T = (4,3,3)$ follows a categorical distribution. Specifically, we would calculate the probability of obtaining a result as extreme as the one observed. Under the null hypothesis that the sample was generated fairly, the probability of obtaining this exact result is $$\Pr[T = (4,3,3)] = \frac{10!}{4! 3! 3!} (1/3)^4 (1/3)^3 (1/3)^3 = \frac{1400}{19683}.$$ Then the $p$-value of the test is the sum of all categorical distribution probabilities that are less than or equal to this value. But it turns out in this case that all of them are, because $$\frac{10!}{t_1! t_2! t_3!} \le \frac{10!}{4! 3! 3!}$$ for any choice of nonnegative integers $t_1, t_2, t_3$ satisfying $t_1 + t_2 + t_3 = 10$. Thus the $p$-value of this test will be $1$ and we have insufficient evidence to suggest the outcome was unfair.

But suppose we observed something different, say $$T = (t_1, t_2, t_3) = (31, 22, 47).$$ Here $n = 100$ and this corresponds to observing $31$ $0$'s, $22$ $1$'s, and $47$ $2$'s. Then the probability of observing this $T$, assuming the generator is fair, is $$\Pr[T = (31, 22, 47)] = \frac{100!}{31! 22! 47!} (1/3)^{100} \approx 0.0000757569.$$ The $p$-value is the sum $$p = \sum_{T \in \mathcal S} \frac{100!}{t_1! t_2! t_3!} (1/3)^{100}$$ where $S$ is the set of triplets $(t_1, t_2, t_3)$ for which $0 \le t_1, t_2, t_3 \le 100$ and $t_1 + t_2 + t_3 = 100$ and $\Pr[T = (t_1, t_2, t_3)] \le \Pr[T = (31, 22, 47)]$. This is difficult to enumerate exactly and requires a computer to calculate, but I get $$p \approx 0.00910101,$$ which means that the probability that we could have gotten this unfair a result from a fair generator by pure chance is about $0.9\%$, furnishing evidence that the generator is in fact unfair.

Since this test based on a categorical distribution is not easy to implement for large samples, an approximate test based on the asymptotic distribution is preferable, which is the chi-square test. This would involve the same tally as before, but you would then compute the statistic $$\sum_{i=1}^m \frac{(t_i - n/m)^2}{n/m},$$ where $m$ is the number of categories (in our earlier example, $m = 3$), $n$ is the sample size, and $t_i$ is the observed frequency of category $i$. So our previous example would have the test statistic $$\frac{(31-100/3)^2}{100/3} + \frac{(22-100/3)^2}{100/3} + \frac{(47-100/3)^2}{100/3} = 9.62.$$ This statistic is chi-square distributed with $m - 1$ degrees of freedom. Thus the $p$-value is $$\Pr[\chi^2_2 > 9.62] \approx 0.00814786.$$

Now let's consider an example using your situation. Suppose I generated the sample

$$\left( \begin{array}{cccccccccccccccc} 47 & 51 & 36 & 43 & 45 & 47 & 52 & 38 & 39 & 37 & 32 & 40 & 29 & 41 & 38 & 45 \\ 38 & 46 & 34 & 37 & 32 & 42 & 43 & 25 & 38 & 45 & 35 & 33 & 42 & 31 & 42 & 40 \\ 32 & 38 & 33 & 32 & 27 & 42 & 47 & 37 & 41 & 41 & 34 & 45 & 48 & 45 & 37 & 43 \\ 43 & 42 & 42 & 33 & 36 & 35 & 51 & 45 & 33 & 41 & 40 & 46 & 35 & 30 & 24 & 44 \\ 32 & 33 & 46 & 36 & 40 & 37 & 50 & 38 & 39 & 40 & 39 & 35 & 37 & 43 & 41 & 42 \\ 39 & 37 & 45 & 38 & 42 & 40 & 47 & 33 & 34 & 28 & 21 & 41 & 37 & 42 & 48 & 40 \\ 41 & 36 & 39 & 46 & 45 & 38 & 37 & 39 & 36 & 39 & 45 & 34 & 32 & 38 & 41 & 38 \\ 43 & 30 & 40 & 30 & 42 & 40 & 32 & 38 & 40 & 41 & 30 & 32 & 41 & 37 & 41 & 38 \\ 31 & 41 & 49 & 39 & 43 & 43 & 38 & 24 & 36 & 34 & 54 & 38 & 35 & 46 & 48 & 46 \\ 34 & 32 & 48 & 39 & 26 & 40 & 41 & 45 & 43 & 36 & 45 & 38 & 30 & 43 & 39 & 27 \\ 47 & 32 & 43 & 35 & 42 & 39 & 41 & 34 & 44 & 46 & 31 & 39 & 41 & 46 & 46 & 40 \\ 41 & 42 & 31 & 32 & 44 & 46 & 35 & 38 & 43 & 30 & 29 & 51 & 40 & 47 & 39 & 33 \\ 27 & 45 & 46 & 40 & 53 & 41 & 39 & 39 & 32 & 35 & 33 & 43 & 44 & 47 & 31 & 45 \\ 37 & 44 & 35 & 31 & 41 & 39 & 48 & 41 & 38 & 39 & 40 & 33 & 44 & 42 & 42 & 35 \\ 37 & 40 & 43 & 34 & 49 & 36 & 36 & 33 & 35 & 32 & 37 & 40 & 49 & 39 & 40 & 34 \\ 40 & 42 & 42 & 34 & 38 & 41 & 43 & 33 & 45 & 40 & 38 & 41 & 41 & 34 & 45 & 38 \\ \end{array} \right)$$ which is a $16 \times 16$ array of values representing each outcome in $\{0, 1, 2, \ldots, 255\}$ when read left to right, top to bottom. The sample size is $n = 10^4$. Then the chi-squared statistic is $$\sum_{i=1}^{256} \frac{(t_i - 10^4/256)^2}{10^4/256} = 215.424.$$ We compute the probability that a chi-square distribution with $m - 1 = 255$ degrees of freedom exceeds this value; i.e., $$p = \Pr[\chi^2_{255} > 215.424] \approx 0.965803.$$ This is relatively large, so we have insufficient evidence to suggest the generator is unfair. But note that even here, the greatest observed frequency $54$ is over twice the least observed frequency $21$. Yet this was a "fair" sample. You will need a much larger sample to detect small deviations from uniformity with so many categories.