Average length of longest duplicated substring in a random binary string of length N

46 Views Asked by At

What is the average length of longest substring occurring at more than one position in a uniformly random binary string S of length N ?

For example,

n  answer
1  0/2
2  2/4
3  10/8
4  26/16

For n=4

0000  3
0001  2
0010  1
0011  1
0100  1
0101  2
0110  1
0111  2
1000  2
1001  1
1010  2
1011  1
1100  1
1101  1
1110  2
1111  3
-------
      26

Generalized version: in stead of a binary string, S is a uniformly random string of length N with alphabet of size M.

1

There are 1 best solutions below

0
On

By Wikipedia: Longest repeated substring problem , we need to study the average depth of suffix trees of random binary strings. By Analysis of the average depth in a suffix tree under a Markov model, the average depth is $O(\log(N)/h)$, where $h$ is the entropy of the process producing the random bits (and the logarithm is probably base $2$ although all logarithms of $N$ are equivalent under a big-O). If the process produces both 0 and 1 bits uniformly at random, the entropy is $1$ (measured in bits), so $O(\log(N))$. Of course, if the bits aren't uniformly randomly generated, the entropy is lower, the expected depth of the suffix tree increases, and the expected length of the longest repeated substring increases.

Your generalized version is also covered in the paper. Notice that a larger alphabet produces greater entropy produces shorter repeated substrings.