How much variance does the random Mersenne twister algorithm allow

994 Views Asked by At

I'm sorry if I don't know how to phrase this question properly, but I would like to know when using a random number generator like the Mersenne twister, and I generate numbers which fall into a range [1..5] for example. I understand that I should then get an even distribution of those numbers (20% of each) but generally there is always a "variance" of that count (for example the number "1" would come out 18% of the time while "5" 22%).

How much of this variance is allowed and expected? Is there a proper term for this value? Is this dependent on different random number generator algorithms, and if so which ones are better at giving even distributions?

1

There are 1 best solutions below

2
On BEST ANSWER

The Mersenne twister is one of the best pseudorandom number generators available. When used to simulate values uniformly distributed in $(0,5],$ results are reliably as if one were able to sample at random from the distribution $\mathsf{Unif}(0, 5).$

Of course, that means there must be some randomness in the results. For example, if you sample 500 points at random, then the number $X$ of them falling in $(0,1]$ is $X \sim Binom(500, 1/5),$ which has mean $E(X) = np = 500(1/5) = 100,$ and variance $V(X) = np(1-p) = 80.$ The proportion in $(0,1]$ is $\hat p = X/n$, which has mean $E(\hat p) = 1/5)$ and variance $Var(\hat p) = \frac{p(1-p)}{n} = 0.00032.$

If I understand your Question correctly, there is nothing wrong (or 'biased') in what you are observing. Something would be very wrong with a pseudorandom process that put exactly 1/5 of the observations in each of the intervals $(0,1],\, (1,2], \dots\, (4,5].$ But if you take a very large sample, then (accoarding to the Law of Large Numbers) the proportions in each interval will get appropriately closer to 1/5.

Here is a brief simulation in R statistical software (which uses the Mersenne Twoister as its default pseudorandom generator). You can see that with such a large sample, the proportions of counts in each interval are very nearly the same.

n = 10^6;  x = runif(n, 0, 5)
hist(x, br=0:5, prob=T)
d = ceiling(x);  table(d)
## d
##      1      2      3      4      5 
## 200003 200370 200682 199846 199099 
hist(x, br=0:5, ylim=c(0, 210000), col="skyblue", lab=T)

enter image description here

By contrast, here is a relative frequency histogram for a run with $n = 500.$

enter image description here

Addendum: In this simulation with $500$ pseudorandom numbers, the 'observed' counts in the five categories were: $N = (105, 93, 87, 99, 116)$ and the 'expected' counts were $E =100$ in each category. Then the 'chi-squared goodness-of-fit statistic' is $Q = \sum_i \frac{(N_i - E)^2}{E} = 5.0.$ (Small values of $Q$ indicate good fit.) With expected counts as large as 100, $Q$ is well approximated by the distribution $\mathsf{Chisq}(4),$ for which the 95th percentile is 9.488. So, according to this standard test, our observed counts are consistent with equal probabilities in each category. With largest count 116 and smallest count 87, the observed distribution may seem 'uneven', but this degree of variation among counts is within what is to be expected with random sampling.