Probability of name collision

8.4k Views Asked by At

Question on statistics & probability: I am trying to find the best way to build a unique key. Data i have: 1) first name 2) last name 3) DOB 4) gender So, assuming those are correct fields (not john doe AND not jane doe ) , so what is probability within whole US population ( i definitely don't have all US population :-) , but i am looking at possible worst case scenario ) that there is a collision? Another words, how unique is combination of 1name+lname+dob+gender within US population? (can be rounded to the 300 mln )

2

There are 2 best solutions below

2
On BEST ANSWER

Let's think about four sets:

  • $F$, a set of first names in use in USA,
  • $L$, a set of last names in use in USA,
  • $B$, dates of birth of people in USA,
  • $G = \{\mathrm{Male, Female}\}$

A site named howmanyofme.com says:

there are at least 151,671 different last names and 5,163 different first names in common use in the United States

We are looking for the worst case, so $\#F = 5,000$ and $\#L = 150,000$.
The age of people in USA is not so uniformly distributed, but let's assume, that the age is normally distributed and there are people who have born in the latest 90 years. Then $\#B = 90*365 \approx 33,000$.
The gender is very balanced - uniformly distributed. $\#G = 2$
Now let's calculate what's the probability of two people having the same first name, last name, day of birth and gender in the worst case. $$ p = {1 \over {\#F*\#L*\#B*\#G}} \approx {1 \over {45,000,000,000}} $$ Now the number of possible pairs for $a = 300,000,000$ people: $$ n = { {a(a-1)} \over 2} \approx 45,000,000,000,000,000 $$ Now, let's iterate through all the possible pairs. The probability that you get the "same" two people is $p$ and the probability that you don't is $1-p$. You can get the desired pair in the first try, second try, third try or any of the 45 milliard tries. You stop when you find a proper pair. The probability to get at least one pair is one minus the probability that you don't: $$ f = 1 - {1-p}^n = 1 - {({44,999,999,999 \over 45,000,000,000})}^{45,000,000,000,000,000} \approx 1 - 3*10^{-420000} \approx 1 = 100\% $$ That is. 100% probability of getting at least one pair from USA of same name, same date of birth and same gender

1
On

The analysis is incorrect. The name John is far more common than Ferdinand. A weighted probability is necessary. The same is true for last names. Jones is much more common than Sondgeroth. If you are doing a worst case analysis, the values of F and L are much lower than 5,000 and 150,000. The age distribution is not even close to uniformly distributed. It has a peak at around age 40 and a smaller one at around age 10.

What these corrections mean is that the estimate for replicate names and DOB is too low. There could be many replicates in a population of 300,000,000.