Other questions like this start with an empty hash, and look for the expected number of hash insertions until the probability of a collision is 0.5. I'm starting off with a hash with existing keys, and I would like to know how many more keys can I expect to insert until I have a 50% chance of collision.
The "classic" birthday problem says there's a 50% probability that in a room of 23 people there's a collision. But let's say I have 23 people in my room already without a collision so far. How many more people need to enter the room until there's a 50% chance of collision? I can brute force these small numbers, the answer here is $8$:
$$0.5 \approx \frac{23}{365} + (\frac{1-23}{365})\frac{24}{365} + \cdots + (\frac{1-23}{365})(1-\frac{24}{365})\cdots(1-\frac{30}{365})\frac{31}{365}$$
But I'm having trouble generalizing this. I have an old dusty CS/Math BS from >10 years ago and I'm not very good with this MathJAX formatting so please bear with me.
Let
$H$ be the size of the hash space, in my case it is 64^8 entries.
$K$ be the current number of existing keys, in my case it is 10M keys.
$N$ be the number of keys at which I expect a 50% chance of collision.
P(collision at next insertion) = $\frac{K}{H}$
P(collision at cumulative subsequent insertions) = $\frac{K}{H} + (1-\frac{K}{H})\frac{K+1}{H}$ and so forth.
So I'm solving for $N$:
$$\frac{1}{2} = \frac{K}{H} + (1-\frac{K}{H})\frac{K+1}{H} + \cdots + \left[(1-\frac{K}{H})(1-\frac{K+1}{H})\cdots(1-\frac{K+N-1}{H})\frac{K+N}{H}\right]$$
This looks similar to the ole sum-of-a-series trick, I'll try to apply that pattern. Remove the first thing from each term...
$$\frac{1}{2(1-\frac{K}{H})} = \frac{K}{H(1-\frac{K}{H})} + \frac{K+1}{H} + \cdots + \left[(1-\frac{K+1}{H})\cdots(1-\frac{K+N-1}{H})\frac{K+N}{H}\right]$$
Now I'm thinking, $K$ is magnitudes smaller than $H$, so could I declare that $\frac{K}{H}$ is close enough to $\frac{K+1}{H}$ for this substitution? I'll subtract one from each $K + i$ where it makes sense, I think. The error I'm introducing ought to be pretty small.
$$\frac{1}{2(1-\frac{K}{H})} = \frac{K}{H(1-\frac{K}{H})} + \frac{K}{H} + \cdots + \left[(1-\frac{K}{H})\cdots(1-\frac{K+N-2}{H})\frac{K+N-1}{H}\right]$$
Great, now this looks like I can substitute the first series. It's not pretty though.
$$\frac{1}{2(1-\frac{K}{H})} = \frac{K}{H(1-\frac{K}{H})} + \frac{1}{2} - \left[(1-\frac{K}{H})(1-\frac{K+1}{H})\cdots(1-\frac{K+N-1}{H})\frac{K+N}{H}\right]$$
Cleaning up a little.
$$\frac{1}{2}\cdot\frac{K}{H-K} = \left[(1-\frac{K}{H})(1-\frac{K+1}{H})\cdots(1-\frac{K+N-1}{H})\frac{K+N}{H}\right]$$
$$\frac{1}{2}\cdot\frac{K}{H-K} = \frac{K+N}{H}\prod_{i=0}^{N-1}(1-\frac{K+i}{H})$$
Now I'd love to get the RHS into some term raised to a power of $N$ so I can apply logs to solve for $N$ but I'm not really sure how to do this. Any advice is appreciated.
I believe I've found a solution by reframing the problem. Suppose we have a room of 40 people. The possible universes can be graphed as such:
Now we want to "reroll" the dice, and find out the expected number of people required to half the probability of no-collisions:
Wikipedia provides an approximation for this:
$n(p; H) \approx \sqrt{2 H ln\frac{1}{1-p}}$
$n(0.891 + 0.0545; 365) \approx \sqrt{2 H ln\frac{1}{1-p}} \approx 46$.
We expect the room to accommodate six more people without a birthday collision.
For the values I'm working with:
$H = 64^8$
$K = 10^7$
$p(no\;collision) \approx e^{-K^{2}/2H} \approx 0.837$
$n(0.5815; H) \approx \sqrt{2 H ln\frac{1}{1-p}} \approx 2.21 \times 10^7$
If this hash had started with zero existing keys, I would expect the first collision to occur at 21M keys. Since 10M keys already exist, now I expect a collision to occur at 22.1M keys.
Jamming together the above approximations, I end up with:
$E(N) \approx \sqrt{2Hln2+K^2}$