What is the mean smallest Hamming distance asymptotic to?

266 Views Asked by At

Consider a collection $S_n$ of $n$ independent and uniformly distributed binary strings, each of length $n$. Let $f(S_n)$ be the smallest Hamming distance between any pair of strings in $S_n$. I am interested in $\mathbb{E}(f(S_n))$.

Let $X$ be the Hamming distance between a randomly chosen pair of strings in $S_n$. If we let $\mu = \frac{n}{2}$ then we know by the Chernoff bound that:

$$P( X \le (1-\delta)\mu) \le e^{-\frac{\delta^2\mu}{2}}, \qquad 0 \le \delta \le 1.$$

and

$$Pr( X \ge (1+\delta)\mu)\le e^{-\frac{\delta^2\mu}{2+\delta}}, \qquad 0 \le \delta$$

Therefore we may suspect that $\mathbb{E}(f(S_n))$ will be close to $\frac{n}{2}$ for large $n$.

For small $n$, we have e.g. $\mathbb{E}(f(S_4)) = \frac{3}{4}$.

My question is this:

What is $\mathbb{E}(f(S_n))$ asymptotic to?

I am particularly interested in the lower order terms.

2

There are 2 best solutions below

3
On

Between 2 and 100, a fair estimate seems to be $$\mathbb{E}(f(S_n))\approx \frac n2+1-\frac12\sqrt{\pi n\ln(n+1)}$$ Here is the graph after $n/2$ has been subtracted from both sides enter image description here

It may be possible to derive that formula or a similar one by assuming all of the $n\choose2$ distances are independent and normally distributed.

7
On

I'm surprised nobody more qualified than me has attempted a solution. Here's my attempt. It is non-rigorous in a few places and I will point them out.

As the OP implies, the linear term of $E[f(S_n)]$ is $\frac n2$. The min distance is obviously going to be less than $\frac n2$, so there is a gap $g(n)$, i.e. $E[f(S_n)] = \frac n2 - g(n)$ where $g(n)$ is positive and sub-linear.

Claim: for any constant $c \in \mathbb{R}$, the gap $g(n) > c\sqrt{n}$ asymptotically.
(Note that this is weaker than Empy2's experimental finding of $g(n) \approx c \sqrt{n \ln n}.)$

My approach is to find $h(n)$ s.t. $P(f(S_n) > \frac n2 - h(n)) \to 0$. This would imply $E[f(S_n)] \le \frac n2 - h(n)$ and provide a lower bound to the gap: $g(n) = \frac n2 - E[f(S_n)] \ge h(n)$.

Let $D_{ij}$ denote the distance between string $i$ and string $j$, for $1 \le i < j \le n$. Note that all $D_{ij}$ are identically distributed as $D\sim Binomial(n, \frac12)$, although they are not independent. We have:

$$ \begin{array}{} P(f(S_n) > \frac n2 - h(n)) &=& P(\bigcap_{i<j} ~(D_{ij} > \frac n2 - h(n)))\\ &\overset{?}{<}& \prod_{i < j}~ P(D_{ij} > \frac n2 - h(n)) \\ &=& P(D > \frac n2 - h(n))^{n(n-1)/2} \\ &:=& a \end{array} $$

The "$\overset{?}{<}$" step is biggest non-rigorous hole in my approach. The intuition is that having some $D_{ij} > \frac n2 - h(n)$ would "space out" strings $i$ and $j$, and make it harder for another $D_{i'j'} > \frac n2 - h(n)$, i.e. the two events are "anti-correlated", so the probability of the conjunction is less than if they were independent events, i.e. the product of individual probabilities.

Anyway, the next non-rigorous step is to use the Gaussian approximation to Binomial. We have $Var(D) = \frac n4$ so $\frac Dn \sim N(\frac 12, \sigma= {1 \over 2\sqrt{n}})$.

$$ \begin{array}{} b := P(D > \frac n2 - h(n)) &=& P(\frac Dn - \frac 12 > -{h(n) \over n})\\ &=& 1 - P(\frac Dn - \frac 12 \le -{h(n)\over n})\\ &\approx& 1 - \Phi(-{h(n) / n \over \sigma})\\ &=& 1 - \Phi(-{2 h(n) \over \sqrt{n}})\\ &:=& 1 - r(n) \end{array} $$

where $\Phi(x) = P(N(0,1) \le x)$ is the cdf of a standard normal variable.

Note that $-{2 h(n) \over \sqrt{n}} < 0 \implies r(n) := \Phi(-{2 h(n) \over \sqrt{n}}) \in (0, \frac12) \implies b \in (\frac12, 1)$, and in fact for interesting choices of $h(n)$ we will have $r(n) \to 0, b \to 1$. However, all we need is that $a = b^{n(n-1)/2} \to 0$, and this can be achieved by having $r(n) \to 0, b \to 1$ slowly enough. But first, we can now prove:

Claim: for any constant $c \in \mathbb{R}$, the gap $g(n) > c\sqrt{n}$ asymptotically.

Proof: take $h(n) = c\sqrt{n} \implies b = 1 - \Phi(-2c) = $ some constant $< 1 \implies a = b^{n(n-1)/2} \to 0. ~\square$

Further thoughts: When $h(n) = c\sqrt{n}$, we have $r(n) = \Phi(-2c)=$ a constant $> 0$. More interestingly, we can increase $h(n)$ beyond $\sqrt{n}$ and let $r(n) \to 0$. One way to do that and still have $a \to 0$ is to have $t(n) := {1 \over r(n)}$ grow slower than $n^2$, i.e. in the little-o notation:

$$t(n) := {1 \over r(n)} = o(n^2)$$

because if the above holds, then $n^2 / t(n) \to \infty$ and:

$$a = (1 - r(n))^{n(n-1)/2} = \big( (1-{1 \over t(n)})^{t(n)} \big)^{n(n-1)/2t(n)} \to e^{-n(n-1)/2t(n)} \to 0$$

At this point, the problem becomes bounding the tail of a Gaussian $r(n) = \Phi(-{2h(n) \over \sqrt{n}})$. I'm sure lots of work has been done on this issue, and I am completely out of my depth.

One weird aspect is that in this approach, we need to lower bound the Gaussian tail: we need $r(n) \to 0$ slowly enough s.t. $a \to 0$. So e.g. Chebyshev's inequality which gives the upperbound

$$r(n) = \Phi(-{2h(n) \over \sqrt{n}}) \le \frac12 {n \over (2 h(n))^2}$$

is in the wrong direction and does not help (at least not in this direct way).