However, how long will it take to randomly generate an $n$-universal word?

111 Views Asked by At

Suppose $A$ is a finite alphabet, $|A| = m$. Let's call a word $w \in A^*$ $n$-universal iff it contains every word from $A^n$ as a subword. Now, suppose we generate a random $n$-universal word in a following way: we start with the empty word and each step add to the right of it a symbol that we generate independently under uniform distribution. That lasts until we our word becomes $n$-universal (in the long run we will almost surely get it due to the infinite monkey theorem). However, how long will it take?

Let's denote the expected length of the word generated that way/expected number of turns to generate it as $Eu(n, m)$. I would like to know the value of $Eu(n, m)$ (or at least its asymptotic for large $n$ and $m$).

For $m = 1$: as there is only one word of length $n$, thus we are guaranteed to get it at $n$-th turn. Thus $Eu(n, 1) = n$.

For $n = 1$, we will generate an arbitrary symbol, then wait till a symbol that was not generated before is generated. Repeat this until the set of symbols is exhausted. Thus $Eu(1, m) = m(\sum_{i = 1}^{m} \frac{1}{i}) = m(ln(m) + \gamma) + O(1)$

For $n = m = 2$, we first generate two symbols. If they are the same, we will wait till the other symbol is generated and then wait what comes after it. If it is the same then we only have to wait till the initial symbol appears again. Otherwise, we need to wait till that other symbol appears again twice in a row. If two initial symbols are different, we will generate an additional one. If it the same as the previous one, we will wait till two first symbols come it a row. Otherwise we will wait till two second symbols come in a row. Thus $Eu(2, 2) = 2 + \frac{1}{2}(2 + \frac{1}{2}6 + \frac{1}{2}2) + \frac{1}{2}(\frac{1}{2} + 6) = \frac{33}{4}$

However, I do not know how to calculate it for different $(n, m)$.

1

There are 1 best solutions below

1
On

The binary ($m=2$) case is solved by Terry R. McConnell, "The Expected Time to Find a String in a Random Binary Sequence", 2001, p.5, in the context of how long it takes a given Markov chain to visit all of its states. This gives an explicit solution in terms of an easily-constructed doubly stochastic transition matrix $P$.

Letting $T$ denote the waiting time until "all strings of a given length have been observed in the input stream", he shows that ... $$ET = n + 2^{-n}\sum_\sigma E_\sigma T\tag{1}$$ where the sum is over all length-$n$ binary strings, which can be expressed in the following form: $$\sum_\sigma E_\sigma T=\sum_{i\in \text{IndexSet}}\,\sum_{A\subseteq\text{IndexSet},\,A\neq\emptyset}(-1)^{\text{card}(A)+1}\sum_{k\in \text{IndexSet\A}}{(I-P_A)^{-1}}_{i,k}.\tag{2} $$ Here $\text{IndexSet}$ is either $\{1,..,2^n\}$ or $\{0,..,2^n-1\}$ depending on the indexing scheme for matrix elements (e.g., the former in Maple, the latter in SageMath), $A$ varies over all nonempty subsets of $\text{IndexSet}$, and the matrix $P_A$ is the result of zeroing the rows and columns of $P$ that have indices in $A$.

He reports ...

For $n = 1,$ we have $E T = 3.$ For $n = 2, 3,$ and $4$ the values are, respectively, $9.5, {82959\over 3640} \approx 22.79...,$ and $15196470103027446764838236318296131920851968094230950060807620630943693\over 259180013898712074394595904741652282392543237486671525526056835614400,$ which is approximately equal to $58.63287788.$ (We reproduce the exact value to discourage those who might look for a simple formula.)"

I translated McConnell's algorithm into SageMath and confirmed those numbers, except when $n=3:\ \ $ The reported ${8\color{blue}{29}59\over 3640} \approx 22.79$ contains transposed digits, and should instead be ${8\color{blue}{92}59\over 3640} \approx 24.52$, which I also confirmed by Monte Carlo simulation with $10^6$ samples.

(Consequently, something's wrong with the OP's calculation for $(m,n)=(2,2)$, which should give exactly $19\over 2$.)

McConnell's article also has some asymptotic results, including that $E\,T\sim \log(2)\,n\,2^n$ in the binary case.


EDIT:

I find a computational speed-up by a factor of about $10$ by rearranging the triple summation (2) as follows:

$$\sum_\sigma E_\sigma T=\sum_{A\subseteq\text{IndexSet},\,A\neq\emptyset}(-1)^{\text{card}(A)+1}\sum_{k\in \text{IndexSet\A}}\,\sum_{i\in \text{IndexSet}}\,{(I-P_A)^{-1}}_{i,k}.\tag{3} $$

Here is SageMath code for the revised (faster) algorithm:

# SageMath code for a revised version of McConnell's algorithm
def ET(n):
    # build the 2^n-by-2^n transition matrix P
    Nrows = 2^n
    P = matrix(QQ,Nrows) # QQ specifies rational elements
    c = 0
    for r in range(Nrows):
        P[r,c:c+2] = matrix([1/2,1/2])
        c = (c+2) % Nrows
    # prepare fixed quantities for use in the triple sum    
    Identity = matrix.identity(Nrows)
    IndexSet = set(range(Nrows))
    PowerSet = iter(Subsets(IndexSet))
    next(PowerSet)  # skip the empty set
    # perform the triple summation in revised order
    Sum3 = 0 
    for A in PowerSet:
        PA = copy(P)
        for j in A:  # zero the rows & cols indexed by A
            PA[j,:] = 0
            PA[:,j] = 0
        QA = (Identity - PA).inverse() 
        IndexSetMinusA = IndexSet.difference(A) 
        Sum2 = 0
        for k in IndexSetMinusA:
            Sum1 = 0
            for i in IndexSet:
                Sum1 += QA[i,k]
            Sum2 += Sum1
        Sum3 += (-1)^(len(A)+1)*Sum2
    return( n + Sum3/Nrows )

for n in [1..5]:
    res = ET(n)
    print(f"ET({n}) = {res}  (approx {res.n()})")