this is a question posed by an engineer friend who had this probability problem arise at work. It goes like this,
Say we have 5000 marbles. Of them, we know that 50 are duplicates. They are not necessarily all the same duplicates, and not necessarily perfect distinct pairs of duplicates. So, we know that there are 4950 distinct marbles and 50 marbles that are duplicates.
Consider randomly bagging these marbles into 100 bags of 50 marbles. What is the probability that at least one bag has at least one pair of duplicate marbles? Another way of phrasing: What is the probability that after bagging, one or more bags contain two (or more) of the same marbles? (You can simplify this if you would like as it is a bit vague, but an answer to the general question is preferred. So, if you want to assume that the duplicates are all the same or the duplicates all show up in perfect pairs.)
I gave this a shot myself, but it is a bit more complicated than I am capable of solving with certainty. The engineers working with this problem seem to believe the probability is very low. However, I disagree and believe that the probability is considerably high. Any solution including a simplification of the problem to give a rough approximation (or what we hope would be an approximation) is great. I'd prefer to see steps and reasoning to justify your answer as well as teach me how to approach such a problem. Thank you.


Here is a simulation $10^4$ times using R. It makes no assumption about whether the extra $50$ balls are all the same or all different or something else; each time it chooses them with replacement from the distinct set.
which seems to suggest that the probability that at least one bag has matches is about $39\%$, and quite often more than one bag does.
I would say this is not very low.
Alternatively, here is a crude approximation to the same result, taking short-cuts which are not quite correct:
That is suspiciously close to the simulation results.
If you made a further wrong assumption that the number of bags with matches had a Poisson distribution with mean about $-\log_e(0.611)\approx 0.493$ then you might find the probabilities of $0$ bags with a match about $0.611$, $1$ bag with a match about $0.301$, $2$ bags about $0.074$, $3$ bags about $0.012$, and $4$ bags about $0.0015$, still close to the simulation values.