What is the optimal value for x in a 1-1000 random number generator 2 stage algorithm where x is the "split point".

1.2k Views Asked by At

I would like a mathematical answer to this. If I have a random number generator algorithm (let's say 1 to 1000), and I simply pick a random number over and over until I see each of the 1000 possible numbers at least once, as we get closer and closer to 1000 unique numbers seen, finding the next one not yet seen becomes more and more difficult and thus takes more time (with the last few "needed" being the most difficult and most time consuming). For example, if we have already seen 980 unique random numbers, then we only have a 20 in 1000 chance of choosing the last "missing" random number to complete all 1000.

To possibly save time (by making it easier for the algorithm to find the last few random numbers needed), suppose at some point (for example, after seeing 750 of the random numbers), we instead just copy those 250 remaining unseen numbers to some array (a list of nonseen numbers), and then just generate a random number between 1 and 250 (for example) until all of those are seen, and then merge the results with the first part (750+250=1000).

So my question is, to save the most time (assuming it takes little to no time to make the split, vs. the major time wasted trying to find the last few numbers in the 1-1000 scheme), using x as the split position (in the above example, x=750), what is the optimal split position mathematically speaking so that the fewest number of "collisions" (picking the same random number more than once) happens. That is, choose a single value of x such that the # of collisions is minimized. For simplicity, we will say you can only split the 1-1000 scheme once. My guess is it might be optimal somewhere between 500 and 750 but I wonder mathematically what the correct answer is.

A computer simulation solution might not be helpful because of caching and other factors, but a computer program to count up the number of random number events and # of collisions would be useful, minimizing those.

While people are "chewing" away at this mathematically, I may write a computer program to count up the # of random numbers generated, and # of collisions, and try to optimize it (by trying every possible split position between perhaps 500 and 900).

Food for thought but not part of the question: If for some reason 500 is the correct answer, then I wonder if splitting those again (such as 4 groups of 250 numbers) would be even better.

Just to give you a "feel" for what is happening here, I wrote a quick computer program and found that for an array of size 1000, I am seeing on average about 7500 calls to the RNG (Random Number Generator) function, of course about 6500 of them (1000 less) being collisions. I ran this 1000 times and got an average. So we are at a 86.7% collision rate (6500/7500) to get all 1000 random numbers this way.

Note that if I stop after finding only 750 unique random numbers (the last 250 not yet found), the 7500 number drops to only 1384 and the collisions drop from 86.7% to only 45.8% (634/1384). For x=500, I get 692.5 calls to the RNG function and 192.5/692.5 = only 27.8% collisions.

2

There are 2 best solutions below

12
On

You might be interested in reading about the coupon collector's problem and the related questions and answers on this site tagged with .

The number of random number generations is $1000$ plus the expected number of collisions, so minimizing the number of collisions is equivalent to minimizing the number of random number generations, which is a bit simpler.

When you have $j$ out of $k$ numbers left to find, the expected number of random number generations to find the next one is $\frac kj$. If we split $n=1000$ at $x$, then we expect the first $x$ numbers to require

$$ \sum_{k=0}^{x-1}\frac n{n-k}=n(H_n-H_{n-x}) $$

random number generations, where $H_i$ is the $i$-th harmonic number, and the last $n-x$ number to require

$$ \sum_{k=x}^{n-1}\frac{n-x}{n-k}=(n-x)H_{n-x} $$

random number generations. (The second part is the standard coupon collector's problem for $n-x$ types of coupon.)

Thus the expected total number of random number generations is

$$ n(H_n-H_{n-x})+(n-x)H_{n-x}=nH_n-xH_{n-x}\;. $$

The first term doesn't depend on $x$ and is just the result for the standard coupon collector's problem for $n$ types of coupons. The second term represents the savings from the split, and we need to maximize it. Since $H_n\approx\gamma+\log n$, we can get an approximation from

$$ \frac{\mathrm d}{\mathrm dx}\left(x(\gamma+\log(n-x))\right)=\gamma+\log(n-x)-\frac x{n-x}=0\;. $$

The solution is $x\approx848.4$, and a table for $xH_{n-x}$ in the vicinity shows that the maximum occurs at $x=849$, with expected savings of

$$ 849H_{1000-849}\approx4752.53 $$

out of

$$ 1000H_{1000}\approx7485.47 $$

random number generations. Thus, with optimal choice of the split point, you can save almost two thirds of the random number generations.

0
On

I suppose this answer can be found fairly easily on a computer just by starting with no split, and then finding the split position that yields the smallest number of collisions. This can be found iteratively. However, one thing I noticed by implementing the no split solution is that it is efficient (as far as runtime and ease of coding), to just get the first unique 996 random numbers, then choose a random number from 1 to 24 (or 0 to 23), and just use that to determine the order of the last 4 "missing" numbers (out of the original 1000). That alone speeds it up by about 30% but even that did not beat the runtime speed of the Fisher-Yates algorithm.

I still have to try implementing the split at x=849 and time it.