Counting the expected number of strings with a given contiguous substring.

904 Views Asked by At

Question:

Let a random bit string $x$ of length $n$ be given.

What is the expected number of bit strings $w$ of length $2n$ that contain $x$ as a contiguous substring?

What I know:

For any bit string $x$ of length $n$, the total number of length $2n$ bit strings $w$ that contain $x$ as a contiguous substring is between $2^n$ and $(n+1) \cdot 2^n$. The exact number may depend on the string $x$ that you pick.

To the point:

Is the expected number $\Theta(2^n)$, or $\Theta(n \cdot 2^n)$, or something in between?

1

There are 1 best solutions below

5
On BEST ANSWER

The expected number is $\Theta(n\, 2^n)$. The OP observes that this is an upper bound, so we need to show it is a lower bound.

In one sentence, the proof is: $x$ probably doesn't overlap itself, so you can only fit $O(1)$ copies of $x$ into any one $2n$-long string, so the occurrences of $x$ as substrings must be distributed over lots of $2n$-long strings.

Here are more details; I'm going to ignore some ceilings and floors for simplicity. Fix an $n$-long string $x$; let $s(x)$ denote the smallest "shift amount" $s$ such that $x$ "overlaps" with itself at a shift of $s$; that is, the $(n-s)$-prefix of $x$ agrees with the $(n-s)$-suffix of $x$. (Set $s(x)=n$ if no such shift amount exists.) Then a $2n$-long string can only contain $1+n/s(x)$ copies of $x$.

The total number of $n$-long strings $x$ with $s(x)\le k$ is at most $2^1+2^2+\cdots+2^{k}<2^{k+1}$. So the probability that $s(x)<n/2$ is $O(2^{-n/2})$. Thus, with probability at least $1-O(2^{-n/2})$, the maximum number of occurrences of $x$ in any $2n$-long string is at most $1+n/(n/2)=3$.

Finally, the total number of pairs $(w,i)$ such that $w$ is a $2n$-long string, in which $x$ occurs as a substring at position $i$, is $(n+1)2^n$. The previous paragraph shows that these occurrences of $x$ as substrings are spread out: with probability at least $1-O(2^{-n/2})$, $x$ occurs in at least $(n+1)2^n /3$ of the words $w$. Hence the expectation, taken over $x$, of the number of words $w$ containing $x$, is at least $$(1-O(2^{-n/2}))(n+1)2^n /3= \Omega(n\, 2^n)$$ as required.