Recursive factorization of words

133 Views Asked by At

Let $\Sigma$ be an alphabet of cardinal $n$. Let $T$ be the set of ordered binary tree whose nodes are labeled by words over $\Sigma$, such that each leaf is labeled by a letter $a\in \Sigma$ and the parent of two nodes is labeled by the concatenation of their labels. Given such a tree $t\in T$, we write $W_t$ for the set of words that appear as labels in $t$, and $|W_t|$ for its cardinal. Given $u\in\Sigma^*$, let $T_u$ be the subset of $T$ consisting of trees whose root is labeled by $u$. I am interested in bounds on $$f(n,l)=\max_{u\in\Sigma^l}\min_{t\in T_u}|W_t|$$ in terms of $l$ when $n$ is constant, or in terms of both $l$ and $n$.

This quantity is useful because if we replace the tree by the corresponding directed acyclic graph, $|W_t|$ corresponds to the number of nodes in the DAG, and the quantity therefore correspond to the maximal number of nodes one may need to represent a word of length $l$. I use this DAG structure in this answer.

For now I only know that:

  • $\boxed{l\le n \Rightarrow f(n,l)\ge l}$ If $l\le n$ then by taking a word $u$ that uses $l$ distinct letters, we get $|W_t|\ge l$ for any $t\in T_u$ by counting the leaves.

  • $\boxed{l=2^k\Rightarrow f(1,l)=\log_2(l)+1}$ If $n=1$ and $l=2^k$, then we can build a tree $t\in T_u$ that only uses the labels $a^{2^i}$ for $1\le i\le k$, and we have $|W_t|=k+1=\log_2(l)+1$ words.

Ideally, I would like a bound that either proves or disproves $f(2, l)=o(l)$.

1

There are 1 best solutions below

0
On BEST ANSWER

Unless I made a mistake, I think it's easy to show $f(2,l) = o(l)$.

It suffices to prove this for $l$ a power of $2$, say $l = 2^k$.

Take $u \in \{a,b\}^l$ and just do a complete binary tree of depth $l$ with leaves being the bits of $u$.

At depth $j$, there are $2^j$ strings each of length $2^{k-j}$. Since there are at most $2^{2^{k-j}}$ strings of length $2^{k-j}$, there are at most $\max(2^j,2^{2^{k-j}})$ distinct strings at depth $j$. So there are a total of at most $\sum_{j=0}^l \max(2^j,2^{2^{k-j}})$ strings that appear as nodes. We upper bound this sum by $\sum_{j=0}^{k-\frac{1}{2}\log k} 2^j+\sum_{j=k-\frac{1}{2}\log k}^k 2^{2^{k-j}}$. The first term is basically $2^{k-\frac{1}{2}\log k} = \frac{1}{\sqrt{k}}2^k$, and the second term is basically $2^{2^{1/2\log k}} = 2^{\sqrt{k}}$. So we get $f(2,l) = O(\frac{1}{\sqrt{k}}2^k)$.

A more interesting question is whether $\log f(2,l) = o(\log l)$, i.e., whether $f(2,2^k) = 2^{o(k)}$, but I think this can be disproved by looking at random strings, since all substrings of length $> \log n$ will be distinct.