How to work out the probability of two random sequences sharing a certain number of matches?

94 Views Asked by At

Pick two sequences of numbers, $S_1$ and $S_2$. $S_1$ is $n_1$ picks from $1$ to $k$, $S_2$ is $n_2$ picks from $1$ to $k$. There could be duplicates within each sequence, for instance $S_1$ might contain the number $471$ twice.

If $S_1$ contains $471$ twice and $S_2$ doesn't contain it, there is no match. If $S_2$ contains $471$ once, there is only one match. If $S_2$ also contains $471$ twice or more, there are two matches. A number from each sequence will match if it's the same as a number in other sequence (order is not important) that is not already matched.

What is the probability $p$ that the two sequences will have exactly $m$ matches between them? Obviously $n_1 \ge m \le n_2$, otherwise $p$ is always $0$.

I can work it out manually for low values of $m$, but once it gets to $3$ I get stuck. It sounds similar to the Birthday Problem.

If $k = 10$, $n_1 = 2$ and $n_2 = 2$, the results are as follows:

$m$ $p$
$0$ $0.657$
$1$ $0.324$
$2$ $0.019$

But it's complicated to work out by hand.

For instance, if $k = 1000$, $n_1 = 99$, $n_2 = 101$, and $m = 10$, what is the probability $p$?

1

There are 1 best solutions below

0
On

The calculations indeed look complicated, especially the minimum-of-two element. You could use simulation. For example using R

matches <- function(k,n1,n2){
  X1 <- sample(k, n1, replace=TRUE)
  X2 <- sample(k, n2, replace=TRUE)
  t1 <- table(factor(X1, levels = 1:k))
  t2 <- table(factor(X2, levels = 1:k))
  m <- ifelse(t1 < t2, t1, t2)
  sum(m)
  }

set.seed(2023)
simmatches <- replicate(10^5, matches(10,2,2))
table(simmatches)

gives

simmatches
    0     1     2 
65512 32547  1941 

close to your calculations with some simulation noise, while

simmatches <- replicate(10^5, matches(1000,99,101))
table(simmatches)

gives

simmatches
    0     1     2     3     4     5     6     7     8     9    10    11    12 
    3    50   282   970  2485  5146  8458 11910 13920 14269 13185 10601  7645 
    
   13    14    15    16    17    18    19    20    21    22 
 5083  3025  1613   732   373   150    66    24     4     6

suggesting the probability for $m=10$ is about $0.13$