I would like a mathematical answer to this. If I have a random number generator algorithm (let's say 1 to 1000), and I simply pick a random number over and over until I see each of the 1000 possible numbers at least once, as we get closer and closer to 1000 unique numbers seen, finding the next one not yet seen becomes more and more difficult and thus takes more time (with the last few "needed" being the most difficult and most time consuming). For example, if we have already seen 980 unique random numbers, then we only have a 20 in 1000 chance of choosing the last "missing" random number to complete all 1000.
To possibly save time (by making it easier for the algorithm to find the last few random numbers needed), suppose at some point (for example, after seeing 750 of the random numbers), we instead just copy those 250 remaining unseen numbers to some array (a list of nonseen numbers), and then just generate a random number between 1 and 250 (for example) until all of those are seen, and then merge the results with the first part (750+250=1000).
So my question is, to save the most time (assuming it takes little to no time to make the split, vs. the major time wasted trying to find the last few numbers in the 1-1000 scheme), using x as the split position (in the above example, x=750), what is the optimal split position mathematically speaking so that the fewest number of "collisions" (picking the same random number more than once) happens. That is, choose a single value of x such that the # of collisions is minimized. For simplicity, we will say you can only split the 1-1000 scheme once. My guess is it might be optimal somewhere between 500 and 750 but I wonder mathematically what the correct answer is.
A computer simulation solution might not be helpful because of caching and other factors, but a computer program to count up the number of random number events and # of collisions would be useful, minimizing those.
While people are "chewing" away at this mathematically, I may write a computer program to count up the # of random numbers generated, and # of collisions, and try to optimize it (by trying every possible split position between perhaps 500 and 900).
Food for thought but not part of the question: If for some reason 500 is the correct answer, then I wonder if splitting those again (such as 4 groups of 250 numbers) would be even better.
Just to give you a "feel" for what is happening here, I wrote a quick computer program and found that for an array of size 1000, I am seeing on average about 7500 calls to the RNG (Random Number Generator) function, of course about 6500 of them (1000 less) being collisions. I ran this 1000 times and got an average. So we are at a 86.7% collision rate (6500/7500) to get all 1000 random numbers this way.
Note that if I stop after finding only 750 unique random numbers (the last 250 not yet found), the 7500 number drops to only 1384 and the collisions drop from 86.7% to only 45.8% (634/1384). For x=500, I get 692.5 calls to the RNG function and 192.5/692.5 = only 27.8% collisions.
You might be interested in reading about the coupon collector's problem and the related questions and answers on this site tagged with coupon-collector.
The number of random number generations is $1000$ plus the expected number of collisions, so minimizing the number of collisions is equivalent to minimizing the number of random number generations, which is a bit simpler.
When you have $j$ out of $k$ numbers left to find, the expected number of random number generations to find the next one is $\frac kj$. If we split $n=1000$ at $x$, then we expect the first $x$ numbers to require
$$ \sum_{k=0}^{x-1}\frac n{n-k}=n(H_n-H_{n-x}) $$
random number generations, where $H_i$ is the $i$-th harmonic number, and the last $n-x$ number to require
$$ \sum_{k=x}^{n-1}\frac{n-x}{n-k}=(n-x)H_{n-x} $$
random number generations. (The second part is the standard coupon collector's problem for $n-x$ types of coupon.)
Thus the expected total number of random number generations is
$$ n(H_n-H_{n-x})+(n-x)H_{n-x}=nH_n-xH_{n-x}\;. $$
The first term doesn't depend on $x$ and is just the result for the standard coupon collector's problem for $n$ types of coupons. The second term represents the savings from the split, and we need to maximize it. Since $H_n\approx\gamma+\log n$, we can get an approximation from
$$ \frac{\mathrm d}{\mathrm dx}\left(x(\gamma+\log(n-x))\right)=\gamma+\log(n-x)-\frac x{n-x}=0\;. $$
The solution is $x\approx848.4$, and a table for $xH_{n-x}$ in the vicinity shows that the maximum occurs at $x=849$, with expected savings of
$$ 849H_{1000-849}\approx4752.53 $$
out of
$$ 1000H_{1000}\approx7485.47 $$
random number generations. Thus, with optimal choice of the split point, you can save almost two thirds of the random number generations.