Consecutive words out of a random bank

44 Views Asked by At

A library of 20,000 distinct words is generated. Two words out of the 20,000 are "mathematics" and "rocks". What is the number of times that "mathematics" "rocks" will end up next to each other in that specific order in a sequence of 2,000,000 words chosen at random uniformly (replacements occur with each pick)?

How do we go about solving for the expected value for the occurrence of the sequence of the two words for "mathematics rocks" (E[X])?

Where I am right now: I know that this is a linearity of expectation problem, where I first need to calculate the probability of the two words occurring in that particular order, but how do I apply that probability over the scope of 2,000,000?

2

There are 2 best solutions below

0
On

For $i\in \{1,2000000-1\}$ let $X_i$ be the indicator variable for that slot. Thus $X_i=1$ if "mathematics" appears in slot $i$ and "rocks" appears in slot $i+1$, and $X_i=0$ otherwise. Of course $$E[X_i]=\left( \frac 1{20000}\right)^2$$

By Linearity the answer you want is given by $$E=E\left[\sum X_i\right]=\sum E[X_i]=1999999\times \left( \frac 1{20000}\right)^2\approx .005$$

0
On

Following on lulu's answer, if $T$ denotes the total number of occurrences of "mathematics" followed by "rocks", so $T=\sum_{i=1}^{N-1} X_i$, where $N$ is shorthand for $2000000$, we have $$\operatorname{Var}(T) = (N-1)(p^2-p^4) - 2(N-2)p^4,$$ where $p$ is shorthand for $1/20000$. One can see this by noting that if $|t-s|>1,$ then $X_s$ and $I_t$ are independent and hence uncorrelated. The remaining cases are: $\operatorname{Cov}(X_s,X_t) = -p^4$ if $|t-s|=1$ and $\operatorname{Var}(X_t) = \operatorname{Cov}(X_t,X_t) = p^2-p^4.$ The sequence of $X_t$ values is $m$-dependent (for $m=1$ or $m=2$ depending on the definition) thus $T$ has a limiting gaussian distribution, as $N\to\infty$ and $p$ remains fixed: $$ \frac{T-Np^2}{\sqrt{N(p^2-3 p^4)}} \Rightarrow N(0,1).$$