There's a company which gives each customer a unique numeric ID (between 10000000-99990000). Every time a new customer record is created, the algorithm generates a random number and checks if the ID exists. If so, the algorithm repeats until it discovers a new unique ID.
So,
...
Set<Integer> takenIds = ...
Integer newId;
go = true;
while(go) {
newId = generateNewId();
go = takenIds.contains(newId);
}
processNewId(newId);
...
As IDs are consumed, the "cost" to calculate increases. e.g., the first ID will be 0% chance of collision (so, 0% cost, 100% efficient). The second ID will have a $\frac{1}{89,990,000}$% chance of collision, then $\frac{2}{89,990,000}$%, etc.
The CEO is worried this algorithm will be more and more inefficient as the customer base grows. He's not entirely sure, but he intuits that when the company reaches 89 million existing customers, for example, this algorithm will take an exponentially long time to calculate. Currently they have 10 million customers, what is the "cost" of the algorithm? Will the algorithm start to bottleneck when they approach 20 million? 30? 40? etc. The customer base is growing around 200k each year. When can they expect this to be an issue they can no longer ignore?
(note: I'm mostly interested in the math, so "you should make your customer ID pool bigger" or "you should optimize the algorithm" etc isn't a great answer -- I just want the math).
Birthday problem
I think this is an example the birthday problem, but most things I read / understand about the birthday problem pertain to the science of probability of matching pairs. So I'm not exactly sure if that's the case here. I feel like it's more related to ...
Simple probability?
I think the most intuitive answer is just simple probability. e.g. if 50% of IDs are used, then there's a 50% chance of a re-roll (two rolls), 25% chance of three rolls, 12.5% chance of four, etc. e.g. #2, When ID saturation is 90%, then there's a straight 90% chance of re-roll, 81% chance of three, 72.9% chance of four, etc.
Anyways, I'm getting a little dizzy trying to give insightful analysis on this. What's the best way to communicate probability of collisions and complexity of the calculation as the customer count grows?
What you are describing resembles the Coupon collector's problem. One very simple observation you can make is that if you have $N$ possible IDs and you have already assigned $k$ of them, then you will on average need to loop through $\frac{N}{N - k}$ draws before you find an unused one, since that particular event can be modelled via a geometric distribution.
If you have an idea of how long each draw takes, then you can multiply it by that value to estimate how long it will take for a new user to get assigned an ID.