What is the probability distribution of Hamming distance on strings with a latent correlation?

101 Views Asked by At

Generate $(X, Y)$ by randomly drawing $n$ pairs from a real-valued, bivariate distribution with population correlation $\rho$. Compute $R(X)$ and $R(Y)$, where $R(.)$ takes a sequence of real numbers and converts each value to an integer that is its ascending rank order (i.e., the smallest $x_{i}$ becomes $1$, the next-smallest becomes $2$, and so on). Count the number of paired ranks that match, $m = \sum R(x_i) = R(y_i)$, or $n$ minus the Hamming distance between $R(X)$ and $R(Y)$. If $\rho = 0$, it is known that $m$ is distributed Poisson with parameter $\lambda = 1$ asymptotically, i.e., as $n$ goes to infinity. It also follows from Le Cam's theorem that $m$ for non-zero $\rho$ is distributed approximately Poisson, $\lambda = np$, where $p$ is the probability that any one pair will match.

I would like to know $\lambda$ for non-zero $\rho$, but I am stuck on determining or approximating the value of $p$ from $\rho$ and $n$. I know the formula for the probability of, say, rolling a pair of $n$-sided dice that are correlated $\rho$ but otherwise fair, but that does not apply here, because the correlation is defined between continuous $X$ and continuous $Y$, not between discrete $R(X)$ and discrete $R(Y)$. The value of $\rho$ will naturally be attenuated by converting to discrete values.

Thanks!