Probability of Binary Word Matching

521 Views Asked by At

Thanks for reading. Would be lovely if somebody could help me out on this (but not just post the answer) but also how you got there. I'm a programmer and I've ran across this problem which I can't solve. Nobody taught me mathematical probability so I really only know the self-taught basics, please keep that in mind.

What is the probability of finding a binary word of length N within another binary word of length M where of course M $\geq$ N?

Under the assumption that N = 1, I've came up with this solution $p = 1-(\frac{1}{ 2})^M$ .. yeah very impressive.

How do I incoorporate the length of the first word N into this or do I have a completely incorrect approach?

[Criteria]

Lets assume that X can be either 1 or 0 then XX has $2^2=4$ combinations {(0,0), (0,1), (1,0), (1,1)}. The probability of finding either 1 or 0 in this word would be 0.75 (so it doesn't matter if it's found in the first or the second character). The probability of finding 11 for example would be 0.25

What matters is that the first word has to exist continually within the second. For instance we match 11 with a word of length 3 (XXX) with 8 combinations, matches would be at (011) (110) and (111) so the probability would be 0.375.

However match(11, 101) this is not allowed due to the zero in between, although (1, 101) is allowed.

Hope I've explained this clear enough without knowing the adequate terminology. P.S. since N > M makes no sense I assume a fraction to appear somewhere by sheer intuition.

2

There are 2 best solutions below

6
On BEST ANSWER

Sorry but I found that the solution presented here isn't correct. I did try to simulate and confirm the formula proposed by user @Logophobic using random trials of a M bits vector.

Here what I did:

  1. For each random realization of a vetor of size M I did a search to find at least one pattern inside the random generated vector.
  2. I repeated the procedure K times and then I counted all matching results and divided them by the number of trials I did executed. This would give for a large number of trials a near approximation of the real value of the probability of having a pattern inside a vector of size M.

I found that simulated results are not exact but have a good precision. For M>30 it would need more trials to avoid larger errors.

I also verified some results by direct brute force all possible 5 bits pattern occurences inside a M bit vector to verify my simulation for M<=15.

Simulation results

M = 5, prob = 0.031612, HitCount = 31612
M = 6, prob = 0.046956, HitCount = 46956
M = 7, prob = 0.062553, HitCount = 62553
M = 8, prob = 0.077884, HitCount = 77884
M = 9, prob = 0.094238, HitCount = 94238
M = 10, prob = 0.109726, HitCount = 109726
M = 11, prob = 0.124648, HitCount = 124648
M = 12, prob = 0.139473, HitCount = 139473
M = 13, prob = 0.153750, HitCount = 153750
M = 14, prob = 0.168943, HitCount = 168943
M = 15, prob = 0.181898, HitCount = 181898
M = 16, prob = 0.196280, HitCount = 196280
M = 17, prob = 0.210253, HitCount = 210253
M = 18, prob = 0.223612, HitCount = 223612
M = 19, prob = 0.237499, HitCount = 237499
M = 20, prob = 0.249648, HitCount = 249648
M = 21, prob = 0.263080, HitCount = 263080
M = 22, prob = 0.274135, HitCount = 274135
M = 23, prob = 0.287609, HitCount = 287609
M = 24, prob = 0.299382, HitCount = 299382
M = 25, prob = 0.311449, HitCount = 311449
M = 26, prob = 0.323036, HitCount = 323036
M = 27, prob = 0.334595, HitCount = 334595
M = 28, prob = 0.346705, HitCount = 346705
M = 29, prob = 0.356554, HitCount = 356554
M = 30, prob = 0.368210, HitCount = 368210

where $K=10^6$ is the number of trials used in each simulation. The probability is simply $p= HitCount * (\frac{1}{2})^K$, where $HitCount$ is the number of times that a pattern is found in a vector.

Formula results

(obtained with user @Logophobic formula)

$1-(1-\frac{1}{2^N})^{M-N+1}$, with $N=5$

Formula: M = 5, prob = 0.031250
Formula: M = 6, prob = 0.061523
Formula: M = 7, prob = 0.090851
Formula: M = 8, prob = 0.119262
Formula: M = 9, prob = 0.146785
Formula: M = 10, prob = 0.173448
Formula: M = 11, prob = 0.199278
Formula: M = 12, prob = 0.224300
Formula: M = 13, prob = 0.248541
Formula: M = 14, prob = 0.272024
Formula: M = 15, prob = 0.294773
Formula: M = 16, prob = 0.316811
Formula: M = 17, prob = 0.338161
Formula: M = 18, prob = 0.358844
Formula: M = 19, prob = 0.378880
Formula: M = 20, prob = 0.398290
Formula: M = 21, prob = 0.417093
Formula: M = 22, prob = 0.435309
Formula: M = 23, prob = 0.452956
Formula: M = 24, prob = 0.470051
Formula: M = 25, prob = 0.486612
Formula: M = 26, prob = 0.502655
Formula: M = 27, prob = 0.518197
Formula: M = 28, prob = 0.533253
Formula: M = 29, prob = 0.547839
Formula: M = 30, prob = 0.561969

So I would like to ask if someone has a formula that matches my simulation results (as I am confident that it is the correct answer for N=5). I also compared my simulations with some brute force simulation to try all possible combinations for M <=18 (larger values are impraticable to simulate).

Brute force

M = 5, Brute Force prob = 0.031250, HitCount = 1
M = 6, Brute Force prob = 0.046875, HitCount = 3
M = 7, Brute Force prob = 0.062500, HitCount = 8
M = 8, Brute Force prob = 0.078125, HitCount = 20
M = 9, Brute Force prob = 0.093750, HitCount = 48
M = 10, Brute Force prob = 0.109375, HitCount = 112
M = 11, Brute Force prob = 0.124512, HitCount = 255
M = 12, Brute Force prob = 0.139404, HitCount = 571
M = 13, Brute Force prob = 0.154053, HitCount = 1262
M = 14, Brute Force prob = 0.168457, HitCount = 2760
M = 15, Brute Force prob = 0.182617, HitCount = 5984
M = 16, Brute Force prob = 0.196533, HitCount = 12880
M = 17, Brute Force prob = 0.210213, HitCount = 27553
M = 18, Brute Force prob = 0.223660, HitCount = 58631

To confirm this results $prob = HitCount*p^M$ where $p=1/2$ (probability of a single bit is uniform).

All simulations are made considering N=5.

Some graphs: 1 2

1
On

The probability of a binary word of length $N$ occurring at position $i$ within another binary word of length $M\ge N$, where $1\le i\le M-N+1$, is $$P(N_i)=\frac{1}{2^N}$$ The probability of not finding word $N$ at position $i$ within word $M$ is $$P(\lnot N_i)=1-\frac{1}{2^N}$$ The probability of not finding word $N$ anywhere within word $M$ is $$P(\lnot N)=(1-\frac{1}{2^N})^{M-N+1}$$ Finally, the probability of finding word $N$ somewhere within word $M$ is $$P(N)=1-(1-\frac{1}{2^N})^{M-N+1}$$ Note that, for $N=1$, this equates to $1-(1-\frac{1}{2})^M=1-(\frac{1}{2})^M$ as you found.