What is the expected value of the number of random letters needed to produce my string.

1.5k Views Asked by At

Consider the string $p$ with length = $n$. Suppose we have the alphabet $\sum = (a_{1}..a_{k})$ and empty string $s$. All symbols have equal probability of choosing ($p(a_{1})=...=p(a_{k})$). We could add any symbol from our alphabet to end of the string $s$. So my question is:

What is the expected value of the number of letters to be added to get our string $p$?

1

There are 1 best solutions below

0
On BEST ANSWER

Here's a concise recursive version of the main theorem proved in Nielsen's article.

Suppose $X=X_1X_2X_3\ldots$ is an infinite sequence of i.i.d. random letters from a finite alphabet $\Sigma$. Let $T_w$ denote the first-occurrence time of the word $w$ in the sequence $X$; i.e., $T_w$ is the least $t$ such that $w$ occurs as a subword in $X_1\ldots X_t$.

Theorem. For any nonempty word $w\in\Sigma^*$,

$$\mathbb{E}T_w=|\Sigma|^{|w|}+\mathbb{E}T_{w'}$$

where $w'$ denotes the longest proper prefix of $w$ that's also a suffix of $w$, and by convention $\mathbb{E}T_\text{empty string} = 0$.

NB: Here the first-occurrence time of a word is taken to be the index of the last letter in that occurrence, whereas Nielsen takes it to be the index of the first letter in that occurrence.

Example (supposing letters $a,b\in\Sigma$)

$$\begin{align} \mathbb{E}T_{abaaba} &=|\Sigma|^6 + \mathbb{E}T_{aba}\\ &=|\Sigma|^6 + |\Sigma|^3 + \mathbb{E}T_{a}\\ &=|\Sigma|^6 + |\Sigma|^3 + |\Sigma|^1 + \mathbb{E}T_\text{empty string}\\ &= |\Sigma| \times(100101)_{|\Sigma|} \end{align} $$

E.g., taking $\Sigma=\{a,b\}$, this gives $\mathbb{E}T_{abaaba}=2^6+2^3+2^1=\color{blue}{74}$. Consistent with this, I ran $10^6$ simulations of $T_{abaaba}$ (using Sage) and found a sample mean equal to $\color{blue}{74}.0$ with standard error $0.1$.