Using random hexadecimal characters to generate an even distribution of random numbers within an arbitrary base-10 range

114 Views Asked by At

I'm using a random number generator to produce a huge string of random hexadecimal characters which I then cache and pull from to generate base-10 integers within a requested range. The original (flawed) steps looked like:

  1. Request a random base-10 integer within certain bounds
    e.g.: getRandomInt(200, 250), which has a range of 250 - 200 = 50
  2. Determine the minimum number of hex characters required to satisfy that range
    e.g.: for a range of 50, we need 2 hex chars (which covers 1-256)
  3. Pull that many hex characters from the hexadecimal cache
    e.g.: "3A"
  4. Convert those hexadecimal characters to a base-10 integer
    e.g.: 3A16 = 5810
  5. Use a modulus function to ensure the resulting integer is within the desired range
    e.g.: 58 % 50 = 8
  6. Add this to the lower bound for the final result
    e.g.: 200 + 8 = 208

I recently realized that this biases the results towards lower numbers for ranges that aren't evenly divisible into/by 16. e.g.: if you request a number in the range [0,11], then 016 becomes 010, A16 becomes 1010, and B16 becomes 1110, but C16 also becomes 010 again, giving you a 2/16 chance of generating a 0 but only a 1/16 chance of generating a 10.

Potential Solution?

After chatting with ChatGPT and facing how little math I math, I've modified the above Step #3 to pull 1 extra hexadecimal character. So if you request a random base-10 integer between [200,250] (a range of 50, which is satisfied by 2 hex chars), you'll no longer pull 2 hex characters but 3.

This seems to solve the issue and produce evenly distributed results for all ranges I've tested, but I can't say for certain if/why it works. I can kinda make sense of it when relating it to a random floating point number (e.g.: my intuition tells me that using hex characters to generate a random number such as 0.123456780, multiplying it by 10, and then removing the decimal portion to produce a random integer in the range [0,9] would be void of any bias), and I assume the same principle is at play here? With the principle being something along the lines of we can add some amount of excess to the end of our randomly generated number and trim it off to remove biases. But I don't know:

  1. if the implemented solution is actually removing biases
  2. if any of the above conjecture is true
  3. if it is true, whether it applies to the above 6 steps
  4. if it is true and applicable, how to determine the amount of excess which should be added to / trimmed from the end to ensure no biases are produced for a given range
  5. if there is a better solution completely different from my approach
1

There are 1 best solutions below

10
On

How many random digits do we need?

  1. Determine the minimum number of hex characters required to satisfy that range
    e.g.: for a range of 50, we need 2 hex chars (which covers 1-256)

The problem with this approach, as you have noted, is that even if your random hex digit generator gives perfect independent and uniformly-distributed digits, the output will be biased. With 256 hex sequences mapped to 50 different outputs, you'll get

  • 44 of 50 values with a probability of 5/256 (0.01953125).
  • 6 of 50 values with a probability of 6/256 (0.0234375).

A simple workaround for this bias is to request more hex digits than you need. For example, I'll propose the rule that we determine the minimum number of hex digits and then add 2. So for a range of 50, we get 4 hex digits, with 65536 possible combinations. This would give us:

  • 14 of 50 values with a probability of 1310/65536 (0.019989013671875)
  • 36 of 50 values with a probability of 1311/65536 (0.0200042724609375)

Which is still not perfect, but for most purposes the bias is now small enough not to be noticeable.

Rejecting the range bias

If you really want an exact 0.02×50 probability distribution, you can make a slight modification to the algorithm:

  1. Let $r$ = the next 4 hex digits from your cache (interpreted as an integer between 0 and 65535).
  2. If $0 \le r < 65500$ (the cutoff being the highest exact multiple of 50 within range), then return $\lfloor r/1310 \rfloor$ as your random number.
  3. Otherwise ($r \ge 65500$), go back to Step 1 and try again.

Clarification (per discussion in comments): Yes, you could try this test-and-reject approach using the minimum 2 hex digits, without having to obtain extra digits. However, doing so would increase the rejection probability.

For example, if you retrieve 2 hex digits (with 256) possible states, and accept values in $[0, 250)$, you reject $\frac{6}{256} = 0.0234375$ of the random numbers. But with 4 hex digits, accepting values in $[0, 65500)$, you reject only $\frac{36}{65536} = 0.00054931640625$.

Determining the output

  1. Use a modulus function to ensure the resulting integer is within the desired range
    e.g.: 58 % 50 = 8

This rand() % n approach is popular among C programmers, but if you want a random number between 0 (inclusive) and $n$ (exclusive), it's better to use $\lfloor \frac{rn}{m} \rfloor$ (where $r \in [0, m)$ is the random variable). That way, if your random number generator is biased, the bias will be spread evenly throughout the output range, instead of systematically favoring small numbers.

Stretching the randomness

According to a comment from the asker, the hex digits are produced with a quantum random number generator (QRNG) that produces approximately 6800 hex digits per second. This might be adequate for your purposes, but if you need a lot of random numbers real fast, it might not be enough.

If this is an issue, then you can a the hybrid approach: Instead of using the QRNG directly for all random numbers, use it to seed a faster PRNG (that has been tested to meet your specific requirements for uniformity or cryptographic security) and use that for your output. Then periodically reseed it, to break any unwanted patterns that show up in the PRNG output. Something like:

int get_random(int range)
{
    static int counter = 0;

    // reset the counter after MAX_PRNG_ITERATIONS calls
    if (counter == MAX_PRNG_ITERATIONS)
    {
        counter = 0;
    }

    // Re(seed) the PRNG as needed, using the QRNG.
    if (counter == 0)
    {
        PRNG.seed(QRNG.get_digits(PRNG_STATE_DIGITS));
    }

    ++counter;

    return PRNG.get_random(range);
}
```