Simplest way to produce an even distribution of random values?

8.3k Views Asked by At

I'm a software engineer, working on a small randomizer library as part of a larger project.

We're using a cryptographic random number generator, which provides an array of random bytes.
We have to use these random bytes to produce an array of random integers fitting whatever requirements are specified.

For example, let's say someone requests $5$ random 8-bit unsigned integers between $50$ and $200$.
The value $50$ would be assigned to the variable $min$, and $200$ assigned to $max$.

Our generator then produces an array of $5$ random bytes, with values ranging from $0$ to $255$.

The most obvious method for converting each random byte $(n)$ into the desired range, would be:$$min+mod(n,(max-min)+1)$$... where $mod$ is the modulus or modulo operation.

This would convert each random byte $(n)$ into a random integer between $min$ and $max$.

The problem with this solution is:
It doesn't produce an even distribution, because each $n$ is a random integer between $0$ and $255$.
Therefore, in cases where $n>max-min$, the distribution overlaps unevenly with itself.
In the example above, the result is twice as likely to be between $50$ and $154$, as opposed to results between $155$ and $200$.

We need the random distribution to be even across the requested range ($50$ to $200$ in this example).

What's the simplest way to achieve this?

More complicated operations, such as logarithms, will cause a severe drain on performance.
So we'd like to stay within the realm of simple arithmetic, if at all possible.

For bytes where $n>max-min$, could we subtract $(max-min)$ from $n$, and then add resulting difference to the next byte in the array?
This is a possible solution I'm considering - but I'm confused about how it would work.
Are there any pitfalls or nuances that apply here?
How would this type of solution work?

Are there any other solutions that would provide a consistent, even random distribution without draining performance?

6

There are 6 best solutions below

6
On BEST ANSWER

There is a way that would not waste so much entropy in the random source, and also has optimal expected time per call. We wish to construct a universal RNG from a given RNG rand that outputs a uniformly random value in the range [0..k-1]. In your question k=256.

p=0;
q=1;
def rnd(int c):
    global p,q
    while True:
        if q>=c:
            if p<q-q%c:
                v=p%c;
                p=p//c;
                q=q//c;
                return v;
            else:
                p=p%c;
                q=q%c;
        r=rand();
        p=p*k+r;
        q=q*k;

How this works is that p,q represents that we have an unused random choice p in the integer range [0..q-1]. So we just call rand enough times to expand that range and our choice within it. If at any point our random choice is less than q-q%c (the largest multiple of c that is at most q), we can return p%c because it is equally likely to be any residue modulo c, and we then shrink the groups of size c each into an unused random choice. Otherwise we remove that multiple of c from both p and q. In the implementation above, note that the extra if q>=c is redundant, but may increase efficiency if c is large compared to k.


I have tested it and it achieves about 95% (entropy) efficiency for c=3 and about 90% efficiency for c=150.

After thinking a bit, I realized that I was wrong to claim that it is entropy optimal. The missing entropy goes into the choice between the two if cases. There is actually a way to fix this, but it is not simple to implement and when I implemented just one level it only improves the efficiency slightly, so it is quite pointless.

5
On

Another simple (and perhaps already considered and discarded) solution would be to reject some values. That is, if you want your random bytes to be between 50 and 200, you could just discard systematically any random byte outside that range and wait for a random byte that satisfy the desired condition. It can be proven that this gives an uniform distribution over the desired range and statisticians are used to do it, although for a software engineer I understand it will sound like a waste of resources.

This sounds specially problematic if the range is too short, but actually once the range becomes half ther maximum possible range (for instance, 100 to 227), you can divide your range in two without any overlaping. If it is even shorter, then you start discarding some values again, but most you will not. And so on if you have a range which is (approx.) $256/3$, or $256/4$, etc.

IMPORTANT: What comes in the next paragraphs, against my initial intuition, does not give a uniform distribution. It is well know that the sum of uniform r.v. is not uniform, but the r.v. that results from the hereby proposed procedures is far from just a sum of uniform r.v. By now, I leave it as a solution that will not work and I'll try to give an idea of why it is so as soon as I can.

If you don't want to discard anything, maybe a more complicated pattern adding auxiliary random bytes could do the trick. If we want, again, numbers between 50 and 200, we take for every random byte $U\le 150$ the number $U+50$. If $U>150$, then we look for another random byte $V$ as a complement, and take $$U+V-101,\quad \text{if} \quad 151\le U+V\le301$$ and $$U+V-252 \quad \text{if} \quad 302 \le U+V\le 452.$$

Finally, we can discard other cases or search for another complementary random byte; maybe just let the process finish only this way, or perhaps after a preset maximum quantity of complementary bytes is reached.

9
On

Rejecting values should work (as mentioned in Alejandro's answer) and should not take too much time. In particular, let's say $2^{k} $ > min - max $ \geq 2^{k-1}$. Now, generate your random bits and check whether the number corresponding to k bits are in the interval [(min - max), 0]. All of these operations are extremely basic (Finding k reduces to bit shifts), and you should need on average 2 operations (As a matter of fact, number of trials before success is going to be a geometric distribution with parameter arbitrarily close to 1/2; the probability that a geometric distribution with parameter p is N is $(1-p) p^{N-1}$; i.e. only 1 in $2^{500}$ times would you expect it to take 500.

Of course, if division is allowed, you can always divide $2^n$ by the number of values you are trying to generate (min - max + 1). Then, again, you are looking at a geometric distribution.

I don't know any way to do it without discarding values; it seems like the easiest way to do it.

Edit: You should check for $\pm1$ errors if you are implementing.

1
On

After some looking around, see below, I would say use the C++ 11 Standard Library, if you can, otherwise discarding out of range source bytes seems the easiest way. The good way, like reimplementing the Mersenne twister based PRNG approach from the C++ 11 Standard lib etc seems not easy.

My stroll:

That question seems to have been pondered here ("How do I scale down numbers from rand()?").

The slide presentation in this answer ("rand() Considered Harmful | GoingNative 2013") seems interesting.

But analyzing the C++ source for uniform_int_distribution() seems complicated according to this ("c++11 random number generation: how to re-implement uniform_int_distribution with mt19937 without templates").

I would dive deeper into this answer to "Generating a uniform distribution of INTEGERS in C".

BTW my naive thinking was:

The real valued linear transformation from $A=[0, 255]$ to $B=[\min, \max]$ is \begin{align} t(x) &= \min \cdot \left(1- \frac{x}{255} \right) + \max \cdot \frac{x}{255} \\ &= \min + (\max - \min) \cdot \frac{x}{255} \end{align} So the question would be how to implement this in a good way with integer arithmetic.

The above "rand() Considered Harmful" video refers to the troubles with this approach. ("DO NOT MESS WITH FLOATING-POINT") :-)

9
On

This question reminds me a similar but puzzle-oriented one. We may use the cryptographic random bytes generator as a stream of random bits: if we start with some interval $[a,b]$, at each step we may select its right/left part according to the generated random bit. If we want to generate a random integer in the interval $[M,N]$, we may apply the above procedure to the interval $\left[M-\frac{1}{2},N+\frac{1}{2}\right)$ and stop the generation of random bits when it becomes impossible to leave a $\frac{1}{2}$-neighbourhood of some integer point. The waste of information is close to zero, the tricky part is just to implement this in integer arithmetics: each step can be encoded as a simple manipulation on the binary representation of $\frac{1}{L}$, with $L$ being the length of $[M,N]$. The algorithm almost always stops in $\log_2 L+O(1)$ steps, so the computation of $2\log_2 L$ digits of the binary representation of $\frac{1}{L}$ is almost surely enough to write the above algorithm in integer arithmetics and avoid recomputations/rejections. Note: up to the extraction of a further bit we may assume that $L$ is odd without loss of generality. This ensures that the binary representation of $\frac{1}{L}$ is purely periodic, where the length of the period equals the order of $2$ in $\mathbb{Z}/(L\mathbb{Z})^*$.

enter image description here

1
On

(OP hasn't responded to my question about floating point, but I'm posting this mostly as an example for user21820.)

We may choose to think of the supply of random bytes as providing a (virtually) infinitely long mixed radix integer (that is conveniently left-aligned, but inconveniently, we only find out the radix of the next digit as we come to it). At each call to the function, we draw more random bytes until we have acquired enough bits to determine what the next output digit is. This is equivalent to arithmetic decoding.

Throughout, we follow the interval $[p,p+w)$, a range of real number guaranteed to contain the rest of the infinitely long random binary number. As bits are drawn, the width, w, of this interval decreases by a factor of $2$. Depending on the bit, we either keep the low half ($0$) or the high half ($1$) of the interval. When the entire interval fits in a single bin (definition coming soon) we output that bin and rescale p and w as if that bin were the interval $[0,1)$, preparing for the next call.

When we call our random number function to generate an integer in $[0,c)$, we divide the interval $[0,1)$ into $c$ bins (by multiplying by $c$ and using the integer parts of the unit intervals in $[0,c)$ as the labels for the bins. The interval $[p,p+w)$ is likewise scaled to $[cp, cp+cw)$. If a prior call to the function required reading ahead several bits to resolve in which bin the interval fell, the interval $[cp,cw)$ may be so small that it already fits in a single bin. If not, we draw bits, halving the width of the interval until it does fit.

A Mathematica implementation, with a bit of monitoring code, demonstrating usage.

foo = Module[{
      p, w, k, kmant, mant, rnd,
      rndCount, resetCount, vec
    },
    p = 0.;
    w = 1.;
    k = 256;
    kmant = 8;  (* = log_2(k) *)
    mant = 16;  (* bits of mantissa in p (and w) *)

    rndCount = 0;
    resetCount = 0;

    rnd[c_] := Module[{retVal, r},
        (*  Return random integer in [0,c).  *)
        If[c < 2,  (* then *)
          retVal = 0
          ,  (* else  *)
          p = p*c;
          w = w*c;
          (*  There are much sneakier ways to write the next two conditions.  *)
          While[Floor[p] != Ceiling[p + w] - 1,
            If[p > 2^(mant - kmant) w,
              (*  If width is so small that p + w/k loses precision, 
                  restart p and w.  Only happens if random bits conspire 
                  to make [p,p+w) persistently straddle an integer.  *)
              p = 0.;
              w = c + 0.;
              resetCount++;
            ];
            r = RandomInteger[k - 1];  (* random integer in [0,k-1] *)
            rndCount++;
            p = p + r w/k;
            w = w/k;
          ];
          retVal = IntegerPart[p];  (* For this and next, in C, see modf(). *)
          p = FractionalPart[p];
        ];
        retVal
    ];

    (*  generate one million random integers in the range [0,3) = [0,2].  *)
    vec = Table[rnd[3], {10^6}];  
    (*  report the stats for this run  *)
    Print[{rndCount, N[100 rndCount/(10^6 Log[k, 3])], resetCount}];
        (* Log[k,3] = log(3)/log(k)

    (*  return list of integers to foo  *)
    vec
];

(*  Output example: 

    {198998,100.443,681}

    * Drew 198998 random bytes, 100.443% of entropy required to uniquely 
      select one outcome from 3^(10^6) possible outcomes (disregarding 
      conspiracies in the random number generator for resolving which 
      bin the last member of the sequence lies in).
    * Had to reset 681 times due to the risk of precision loss.  This 
      resulted in 198998-198121=877 drawn random bytes being discarded.
*)
(*
      One can also 
  Histogram[foo]
      or
  Length/@Split[Sort[foo]]
      to decide whether uniformity was attained.  The run with the above 
      stats had output counts of
  {332942, 333926, 333132}

      We could even run a chi-squared test to see if the counts above are 
      sufficiently extreme to reject that the data is drawn from the 
      uniform distribution.
  PearsonChiSquareTest[foo, DiscreteUniformDistribution[{0, 2}], 
  "TestDataTable"]
      We find that our test statistic is 1.63479... compared to a chi-
      squared distribution with two degrees of freedom.  The resulting p-
      value is 0.44158...  (That is, only 44.158...% of data drawn from 
      the uniform distribution would have output counts closer to the 
      expected value than these.)  This data is not sufficiently 
      extreme to reject that it is drawn from the uniform distribution.
*)

It is possible to implement this entirely in (arbitrary precision) integers. But this cannot be done (exactly) in finite precision since the random bit source may conspire to require arbitrarily large read-ahead to resolve to which side of a bin boundary the interval eventually falls. (Although, such long read ahead is exponentially unlikely -- at each new bit, either the interval falls entirely to one side of the boundary or it does not, with equal probability.)

user21820's answer is an approximate implementation of this idea, representing the interval as $\left[ \frac{p}{q}, \frac{p+1}{q} \right)$ in integers $p$ and $q$, "barber-poling" the integers in $[0,q-1)$ so that incrementing $p$ increments the represented output value ($p=0$ represents the left $c/q$-wide subinterval corresponding to $v=0$, $p=1$ represents the left $c/q$-wide subinterval corresponding to $v=1$, ..., $p=c-1$ represents the left $c/q$-wide subinterval corresponding to $v = c-1$, $p=c$ represents the second $c/q$-wide subinterval corresponding to $v=0$, and so on ...). Note that this can't represent an interval inside a single bin until $q \geq c$. Also, at the end, when ready to select an output bin, the interval represented by $p$ and $q$ is (very, very likely) less than a unit wide, but $p$ and $q$ are altered to exactly match the bin with integer divisions. (This side-steps the need for arbitrary precision integers by (very, very likely) discarding a fractional bit on each output.)