How many times can a word appear in a random phrase?

438 Views Asked by At

I'm taking about finding a word in a larger set of characters.

Lets say, what is the probability/chance of finding the word 'math' in a random 8 length phrase.

(For example gjbmdlep does not have math in dhhjmath does!)

I assume it's something like (ignoring caps) 26 to power of 4? -since that's chance for m AND a AND t AND h. Then there's the several placements of math in the word (starting from 1st character, 2nd, 3rd, 4th, 5th, not 6th though etc.)

so 26^4 / 5? Would (1 over that) that be the chance in finding the word math in a random 8 letter phrase?

A follow up question, in every possible iteration of an 8 length phrase how many will contain 'math'? Is it just as simple as finding amount of possible iterations (26^8) and dividing by previous answer?

Any help would be greatly appreciated thanks!

2

There are 2 best solutions below

4
On

If you consider random words, where every letter is equally likely and independent, then your question is solvable with basic combinatorics, because every word has the same probability.

There are $26^8$ words with length $8$ over an alphabet with 26 letters.
There are $5$ positions, where your word "math" can start and the remaining ("uninteresting") $4$ letters gives $5\cdot 26^4$ possibilities. In this case, we have to take care about double-counted words. Here, the word MATHMATH is counted twice in the $5\cdot 26^4$ possibilities, because MATH starts at more then one position (the "uninteresting" letters can form the word MATH as well).

Put it together and you get the probability $\Large\frac{5\cdot 26^4-1}{26^8}$ in your scenario.

Note that in the general case (the word length is $n$ instead of $8$), the problem becomes more difficult because there are more double-counted words. For example the word length $9$ gives us $6\cdot 26^5$ possibilities in a similar approach as above, but there are words $MATHxMATH$ and $MATHMATHx$ here, which are double-counted ($x$ is an arbitrary letter in our alphabet). With alphabet size $4$, this gives $4\cdot 2=8$ double counted-words in this case. With increasing word length $n\ge 12$, words even can be counted more than twice in this approach.

0
On

You need to sum over all possible ways that a $m$-character string can appear in a $n$ character phrase, over an alphabet of size $M$. There are $M^n$ possible words in total. If the string appears at position $i$ (thus, $0 \leq i \leq n - m$, meaning there are $n-m + 1$ possible positions), then there are $i$ random letters in front of the string, followed by $m$ fixed letters, followed by $n - (i + m)$ letters at the end. Thus, independent of the position, you get to choose $n - m$ random letters.

It follows that $(n - m + 1)M^{n - m}$ out of a total of $M^n$ phrases contain the string, i.e. the probability is $$ p = \frac{(n - m + 1)M^{n - m}}{M^n} = (n - m + 1)M^{-m} \text{.} $$

In your case, $n=8$, $m=4$, $M=26$, and you get $$ p = (8 - 4 + 1)26^{-4} = \frac{5}{26^4} \text{.} $$