So I've got a question for which I was able to run a simulation and calculate the probability from this simulation in R (picture below, following a normal distribution) but now I was wondering if there was an exact formula to calculate this as well.
I have two sets A and B. Set A contains 50240 elements and set B contains 16729 elements. All elements of set B are in set A (B is a subset of A). Next I draw 2585 elements from set A without replacement creating set A'. Then I draw 1809 elements from set B again without replacement, creating set B'. Now I want to know how to calculate the probability of getting an overlap between A' and B' of x or more elements.
Expanding the comment a bit. Without losing generality, suppose $A$ is made of integers from 1 to 50240. Same for B: it contains integers 1 to 16729. Now, let's invert the order of our draws (of course nothing changes) and we extract 1809 balls from B.
The hypergeometric distribution will give you the probability of extracting $x$ white balls from an urn containing $m$ white balls and $n$ non white balls when you extract in total $k$ balls without replacement. After the draw from $B$, in your case, we can imagine any ball from $A$ not in $B^\prime$ a non white ball, while the 1809 extracted balls represent the white balls, since they are the common balls whose size probablity you want to calculate.
In R, we have the argument
xthat ranges from 0 to 1809 (you cannot have of course more than 1809 common elements),m = 1809,n = 50240 - 1809andk = 2585. In one line you can calculate the probability for eachx:You will see that outside the [50:130] range probabilities are very close to zero. Plotting this range will result in good agreement (at least visually) with your simulation:
If you are interested in the formula to get those probabilities, the wiki article contains everything of course.