Probability of encountering a control string in random data

131 Views Asked by At

I'm writing a program that diffs two binary files with a common ancestor, where subsequent insertions/deletions/alterations have been made. While writing the program, I got to thinking about this statistical question:

$B$ is an arbitrary length stream of random bits .

$B_1$ is an array which partitions $B$ into $N$-bit long segments, where $B_1[0]$ is the first $N$ bits of $B$, $B_1[1]$ is the next $N$ bits of $B$, etc.

Given an $N$-bit long "control sequence" $C$, we have two probabilities:

$P_1$ is the probability that any of the first $M$ entries in $B_1$ matches $C$.

$P_2$ is the probability that $C$ is a substring of the first $N+M-1$ bits of $B$.

Are $P_1$ and $P_2$ equivalent, or is one more probable than the other? My gut feeling is that the two are equivalent, but I can't figure out how to prove it.

1

There are 1 best solutions below

1
On BEST ANSWER

In general it depends on the string. $P_1$ does not depend on your chosen string. It can be expressed as $1-\left(1-2^{-N}\right)^M$. $2^{-N}$ is the probability of matching each element of $B_1$, so $\left(1-2^{N}\right)^M$ is the probability of not matching any of the $M$ elements of the array.

Consider for simplicity the case where $M=N=2$. In this case $P_1= 1-\left(1-\frac 14\right)^2 = \frac 7{16}$. You can check this for both strings (if you like) by counting the 16 possible $B$s.

But we can't have $P_1=P_2$ because $N+M-1 =3$ and there are only $8$ choices for the first three bits, so $P_2$ must be expressible as $\frac n8$.

We can count the sequences for each possible $C$. For $C=00$ the matching subsequences are $000$, $001$, $100$ so the probability $P_2=\frac 38 < \frac7{16}$.

For $C=01$ the matching subsequences are $001$, $010$, $011$, $101$ so the probability $P_2=\frac 48 > \frac7{16}$.

Notice that in each case $C$ appears four times in total out of the possible length three strings. For the case of $01$, $C$ appears once in each of the four matching strings. $00$ appears in only three sequences, but it appears twice in $000$!

In fact the rule is that the average number of matches is the same in both the $P_1$ and $P_2$ cases. (In general $\frac M{2^N}$) But the probability of seeing at least one match depends on the probability of seeing the string $C$ appear in multiple cases. For $P_1$ that doesn't depend on the string, but for $P_2$ strings with patterns are more likely to appear two or more times (because they can overlap) so the probability of matching at least once is smaller than for strings that don't repeat.