Question on statistics & probability: I am trying to find the best way to build a unique key. Data i have: 1) first name 2) last name 3) DOB 4) gender So, assuming those are correct fields (not john doe AND not jane doe ) , so what is probability within whole US population ( i definitely don't have all US population :-) , but i am looking at possible worst case scenario ) that there is a collision? Another words, how unique is combination of 1name+lname+dob+gender within US population? (can be rounded to the 300 mln )
2026-03-26 06:03:37.1774505017
On
Probability of name collision
8.4k Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
2
There are 2 best solutions below
1
On
The analysis is incorrect. The name John is far more common than Ferdinand. A weighted probability is necessary. The same is true for last names. Jones is much more common than Sondgeroth. If you are doing a worst case analysis, the values of F and L are much lower than 5,000 and 150,000. The age distribution is not even close to uniformly distributed. It has a peak at around age 40 and a smaller one at around age 10.
What these corrections mean is that the estimate for replicate names and DOB is too low. There could be many replicates in a population of 300,000,000.
Let's think about four sets:
A site named howmanyofme.com says:
We are looking for the worst case, so $\#F = 5,000$ and $\#L = 150,000$.
The age of people in USA is not so uniformly distributed, but let's assume, that the age is normally distributed and there are people who have born in the latest 90 years. Then $\#B = 90*365 \approx 33,000$.
The gender is very balanced - uniformly distributed. $\#G = 2$
Now let's calculate what's the probability of two people having the same first name, last name, day of birth and gender in the worst case. $$ p = {1 \over {\#F*\#L*\#B*\#G}} \approx {1 \over {45,000,000,000}} $$ Now the number of possible pairs for $a = 300,000,000$ people: $$ n = { {a(a-1)} \over 2} \approx 45,000,000,000,000,000 $$ Now, let's iterate through all the possible pairs. The probability that you get the "same" two people is $p$ and the probability that you don't is $1-p$. You can get the desired pair in the first try, second try, third try or any of the 45 milliard tries. You stop when you find a proper pair. The probability to get at least one pair is one minus the probability that you don't: $$ f = 1 - {1-p}^n = 1 - {({44,999,999,999 \over 45,000,000,000})}^{45,000,000,000,000,000} \approx 1 - 3*10^{-420000} \approx 1 = 100\% $$ That is. 100% probability of getting at least one pair from USA of same name, same date of birth and same gender