Finding the expected value in the given problem.

4.7k Views Asked by At

It is given that a monkey types on a 26-letter keyboard with all the keys as lowercase English alphabets. Each letter is chosen independently and uniformly at random. If the monkey types 1,000,000 letters, what is the expected number of times the sequence "proof" appears?

Here is the suggested solution:

Let the random variable $X_i = 1$ if the word "proof" appears at the index $i$ else $X_i = 0$. Let $n = 1,000,000$
Hence, the expected value of number of appearances of the word is:

\begin{align} &E\left[ \sum\limits _{i=1}^{n-4}X_i \right] & & \text{Eqn 1} \\ &= \sum\limits _{i=1}^{n-4}E[X_i] & & \text{Eqn 2} \end{align}

Now $E[X_i] = 26^{-5}$.

Hence expected number of appearances = $(n-4)\cdot26^{-5}$

(Note that the upper limit is $n-4$ because the word proof is of length $5$. Hence it can at most start at the index $n-4$ to finish the index $n$)

Now here is my doubt. We know that if $X_i = 1$, then the following few random variables like $X_{i+1}$,$X_{i+2}$ etc, can't be $1$ because you cannot have a word proof starting at index $i$ and then another word proof starting at index i+1. Where in this proof have we imposed that restriction?

4

There are 4 best solutions below

0
On BEST ANSWER

To simplify the subscripts, I'm going to consider just the first few letters the monkey types. Add a variable $k$ to all my subscripts if you want to apply the reasoning at an arbitrary point in the string.

It is true that $E(X_1) = 26^{-5}$ and also that $E(X_2) = 26^{-5}$. We just have to cycle through the $26^6$ equally-likely possibilities for the first six letters the monkey types. Of these, $26$ are strings of the form "proof_" and another $26$ are of the form "_proof", where the blank is filled in with some letter from a to z.

What you've observed is that when we condition the expectation of $X_2$ on the value of $X_1$, we get a result that is different from the ordinary (not conditioned) expectation. Specifically, $E(X_2 \mid X_1 = 1) = 0$, which is less than $26^{-5}$.

Since the method you used applies a theorem of probability whose required assumptions are satisfied by the assumptions of your question (in particular, the expectation of each $X_i$ exists), we should find that there is something else going on that will somehow "balance out" the fact that the observation $X_1 = 1$ lowers the expected value of $X_2$.

And in fact there is something else going on.

We only observe $X_1 = 1$ once, on average, for each $26^5$ times we let a monkey type a million letters. The other $26^5 - 1$ times (on average) that we do this, we observe $X_1 = 0$. In just $26$ of those $26^5 - 1$ times (on average), the first five letters typed by the monkey will be a string of the form "_proo", where the blank is filled by a letter from a to z. In those cases, there is a $1/26$ probability (conditioned on the observed data) that the sixth letter will be f and that $X_2$ will be $1$, that is, $$P(X_2 = 1 \mid \text{letters 2 through 5 are "proo"}) = 1/26.$$ In the other $26^5 - 27$ cases, there is zero probability (conditioned on the observed data) that $X_2 = 1$.

Let $A$ be the event that first five letters have the form "_proo", $B$ the event that the first five letters are neither "proof" nor anything of the form "_proo". Then the expectation of $X_2$ conditioned on the observation $X_1 = 0$ is \begin{align} E(X_2 \mid X_1 = 0) &= 0 \cdot P(X_2 = 0 \mid A) \, P(A \mid X_1 = 0) \\ & \qquad {} + 1 \cdot P(X_2 = 1 \mid A) \, P(A \mid X_1 = 0) \\ & \qquad {} + 0 \cdot P(X_2 = 0 \mid B) \, P(B \mid X_1 = 0) \\ & \qquad {} + 1 \cdot P(X_2 = 1 \mid B) \, P(B \mid X_1 = 0) \\ &= P(X_2 = 1 \mid A) \, P(A \mid X_1 = 0) \\ & \qquad {} + P(X_2 = 1 \mid B) \, P(B \mid X_1 = 0) \\ &= \frac{1}{26} \left( \frac{26}{26^5 - 1} \right) + 0 \cdot P(B \mid X_1 = 0) \\ &= \frac{1}{26^5 - 1} \end{align}

This is ever so slightly greater than the unconditional expectation, $26^{-5}$. If fact it is just large enough so that \begin{align} E(X_2) &= E(X_2 | X_1 = 0)\, P(X_1 = 0) + E(X_2 | X_1 = 1)\, P(X_1 = 1) \\ &= \frac{1}{26^5 - 1} \left( \frac{26^5 - 1}{26^5} \right) + 0 \cdot P(X_1 = 1) \\ &= 26^{-5}. \end{align}

In summary, the fact that $X_1 = 1$ forces $X_2 = 0$ is balanced by the fact that $X_1 = 0$ gives a tiny boost to the probability that $X_2 = 1$.

You could do a similar analysis for the effect of $X_1$ on $X_3$, $X_4$, and $X_5$.

The total effect is that while an occurrence of "proof" at one position rules out occurrences at several nearby positions, each place where "proof" does not occur causes "proof" to be a little more likely than usual to occur in nearby positions. For example, "proof" at position $1$ rules out "proof" at position $5$, but it raises the probability from $26^{-5}$ to $26^{-4}$ that the string at position $5$ will be "fproo", which in turn gives a relatively very high probability ($1/26$) that "proof" will occur at position $6$.

1
On

Note that $E[X+Y]=E[X]+E[Y]$ holds in full generality, even if $X$ and $Y$ are not mutually independent. Proofs of linearity of expectation do not assume independence of $X$ and $Y$. Here's one for example. In other words you do not need to impose that restriction.

8
On

Where in this proof have we imposed that restriction?

Your observation that $X_i=1 \implies \sum_{n=1}^4X_{i+n}=0$ restricts $X_{i+1}\cdots X_{i+4}$ is certainly valid. There is no need to take this restriction into consideration, because it is self-imposing:$$X_i=1 \implies \sum_{n=1}^4X_{i-n}=0$$ Edit:

$E[X_{i+1}]\cdots E[X_{i+4}]$ are not restricted, only the values of $X_{i+1}\cdots X_{i+4}$ in the specific case of $X_i=1$ are restricted. I have corrected the statement above. My point was that $$X_i=1\implies X_{i+1}=0\equiv X_{i+1}=1\implies X_i=0$$ Perhaps I should have said $$\sum_{n=1}^4X_{i+n}\neq0\implies X_i=0$$

0
On

This is not an answer to the question, so this may not be strictly speaking quite welcome here, but I think that any putative proof that the question's expectation formula is not valid should account for this data somehow, and so should any alternative expectation formula for this experiment. The formula in the question seems to match the simulation results below.

I simulated the monkey activity in laboratory conditions ($6$-letter alphabet, text length from $n=5$ to $n=7$), $10$ rounds of $m = 1000,000$ texts (twice each), and got the following results (observed counts sorted and median position marked). The expected count is computed as $m\cdot(n - 4)\cdot(1/6)^5$ and each observed count by counting the occurrences of "proof" in the million texts of that trial (the $6$-letter alphabet contained those letters and two others).

# With text-length n = 5 and a 6-letter alphabet:
# expected: 128.60082304526745
# observed: 103 114 120 124 128 | 128 131 140 146 146  
# observed: 114 115 118 123 124 | 127 131 133 134 136

# With text-length n = 6:
# expected: 257.2016460905349
# observed: 238 243 246 252 256 | 257 260 266 269 278  
# observed: 215 231 232 237 237 | 242 266 273 284 284

# With text-length n = 7
# expected: 385.8024691358023
# observed: 351 353 366 375 384 | 385 390 392 395 401
# observed: 340 346 369 374 384 | 403 415 417 417 421

Added simulation results with $n = 15$ where more than one "proof" can occur in one text. Same format, same apparent result.

# With text-length n = 15
# expected: 1414.6090534979419
# observed: 1362 1371 1372 1382 1385 | 1416 1418 1433 1434 1463
# observed: 1375 1382 1393 1396 1397 | 1416 1447 1454 1464 1477