How many essentially different strings are there of length $\leq n$ and over an alphabet of size $|\Sigma| = m$?

1k Views Asked by At

For example, $aaaaaabb \simeq ccccccdd$ essentially, because a smallest grammar algorithm would perform the exact same steps to reduce one as the other. So how can I phrase this in terms of formal strings, their lengths, etc?

If the length is $0$ there is 1 string.
If the length is $1$ there is $1$ string. If the length is $2$, then $t = aa$ or $t = ab$ wlog so there are $2$ strings. $n = 3 \implies t = aaa, aab, abb, aba, abc$ wlog so there are 5 strings!!

How do you expect me to count this without your help :)

2

There are 2 best solutions below

1
On BEST ANSWER

In short:

I found the following: The number of strings of length $n$ with exactly $i$ different letters equals the number of strings of length $n-1$ with exactly $i-1$ letters plus the number of strings of length $n-1$ with exactly $i$ letters times $i$.

More formally:

Let $x^{(n)}$ denote a vector such that $x^{(n)}_i$ holds the number of strings of length $n$ using exactly $i$ different letters. For $i>m$ we have that $x^{(n)}_i = 0$ because we only have $m$ different letters in our alphabet and thus no strings with exactly $i>m$ letters. So we are only interested in the first $m+1$ entries (strings with exactly $0$ different letters up to strings with exactly $m$ different letters) of $x^{(n)}$. Thus from now on I'll assume that $x^{(n)} \in \mathbb{N}^{m+1}$.

Now the following can be seen to hold: $x^{(n)}_i = x^{(n-1)}_i \cdot i + x^{(n-1)}_{i-1}$. Why is this the case? I'll give an informal agrument:

1) Consider any string of length $n-1$ with exactly $i$ different letters. Then we can append any of the $i$ letters to the string to get a string of length $n$ with exactly $i$ different letters. Consider now any string of length $n-1$ with exactly $i-1$ different letters. Then we append a letter which is not yet present in the string to get a string of length $n$ with exactly $i$ different letters. Thus $x^{(n)}_i \geq x^{(n-1)}_i \cdot i + x^{(n-1)}_{i-1}$.

2) Consider any string of length $n$ with exactly $i$ different letters. By taking away the last letter we get a string of length $n-1$ which either has exactly $i$ or exactly $i-1$ different letters. So every string of length $n$ with exactly $i$ different letters can be obtained by appending the right letter to a string of length $n-1$ with exactly $i$ or exactly $i-1$ letters. Thus $x^{(n)}_i \leq x^{(n-1)}_i \cdot i + x^{(n-1)}_{i-1}$

With 1) and 2) it is clear that the equation $x^{(n)}_i = x^{(n-1)}_i \cdot i + x^{(n-1)}_{i-1}$ holds.

So basically we found a recursive formula for computing $x_i^{(n)}$ for any $i, n \in \mathbb(N)$. We can write this using matrices: $$x^{(n)} = A^n \cdot x^{(0)}$$ where $$A = \begin{bmatrix}0 & 0 & ... & ... & ... & ... & 0\\1 & 1 & 0 & ... & ... & ... & 0\\ 0 & 1 & 2 & 0 & ... & ... & 0 \\ 0 & 0 & 1 & 3 & 0 & ... & 0 \\ ... & ... & ... & ... & ... & ... & ... \\ 0 & ... & ... & ... & 0 & 1 & m \end{bmatrix}$$ In words: A is an $(m+1) \times (m+1)$ matrix with the natural numbers from $0$ to $m$ on the diagonal and $1$'s on the subdiagonal.

Now I tried to simplify the formula: To that end we try to compute $A^n$. This can be done by using the eigenvalue-decomposition of $A$. Since $A$ is a lower triangular matrix we have that $det(A-\lambda * I) = -\lambda(1 - \lambda)(2- \lambda) ... (m - \lambda)$. Thus $A$ has $m+1$ different eigenvalues. We can conclude that $A$ admits an eigenvalue-decomposition. Our formula thus reduces to $$x^{(n)} = U \cdot \Sigma^n \cdot U^{-1} \cdot x^{(0)}$$ where $U$ is an Eigenbasis of $A$ and $\Sigma$ is the diagonal matrix which holds the eigenvalues of $A$. The eigenvalues of $A$ are the natural numbers from $0$ to $m$.

What you initially asked for was the number of strings of length $\leq n$ over an alphabet of size $m$. Now it is clear that $||x^{(n)}||_1$ (1 - norm or just sum of all elements) equals the number of strings of length $n$ over an alphabet of size $m$. This of course gives also a way to compute the number of strings of length $\leq n$ over an alphabet of size $m$.

I didn't compute an Eigenbasis for $A$ but that can be done by a computer fairly easily. Also I think this formula might be simplified further. What one could try is to find an easier formula by computing an Eigenbasis and the 1-norm of the resulting vector. If an easier explicit formula was found I think it shouldn't be too hard to prove the formula by induction (Since I didn't provide any real proof).

I hope this helps, I just wrote down what I tried and found.

2
On

The equivalence classes you are trying to count are called "restricted growth strings". The sequence of counts of all RGS of length $n$ are the "Bell numbers", after the mathematician Eric Temple Bell who studied them in the 1930s.

This corresponds to the count of your "essentially different strings" for the case $n=m$. See sequence A000110 in the Online Encyclopedia of Integer Sequences; there is a link to Bell's work in the reference section.

For the more general question, the number of RGS of length $n$ from an alphabet of size $m$ can be computed as the sum of Stirling numbers of the second kind: \begin{equation}\sum\limits_{k=1}^m\begin{Bmatrix} n \\ k \end{Bmatrix}\end{equation}