How many subsequences do you need to look for to infer a word?

Question

How many subsequences do you need to look for to infer a word?

83 Views Asked by Bumbble Comm At 28 Mar 2026 - 6:40

Suppose that a word of length $L \in \mathbf{N}$ is defined to be an element of $A^{L}$, where $A$ is a finite set of characters and that a subsequence of a word is an ordered list of characters which appear consecutively in the word. If I am going to choose a word of length $L$ and let you choose a set of words, $X$, of length at most $L$ and I promise to tell you how many times each of your words appears as a subsequence of my word, what is the smallest set, $X$, that you can choose and be guaranteed of being able to deduce my word? How does it scale with L and the size of A?

Thoughts: You could of course take $X$ to contain every word of length L. One thought was to try to construct words of length $L$ from words of length $L-1$ and words of length $L-1$ from words of length $L-2$ etc. so that you only need $X$ to contain small words but I haven't been able to make that work. Words like ABABABABABAB...ABAB seem quite hard to deduce. I think knowing how many times every word of length 2 or 3 appears might be enough but haven't proved or disproved this yet.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2022-09-08 17:56:46

Edit: Here's a strategy that uses $O(A^2 L^2)$ words, which is adaptive in the sense that the words used later depend on the answers to the questions about earlier words. I'll only describe it for a binary alphabet $\{ a, b \}$ for simplicity but hopefully the generalization is clear.

Our strategy will be to factor an arbitrary word in $\{ a, b \}^L$ into a product of consecutive runs $a^{\ell_1} b^{\ell_2} a^{\ell_3} \dots $ and attempt to figure out how long the runs are and where they're positioned relative to each other. To be clear, by "runs" I mean they should be as long as possible, so that they don't continue to either the left or the right.

First we ask about all words of the form $a^k$ or $b^k$; these tell us the exact lengths of each run (by induction on the lengths starting from the longest; we need to subtract repeatedly, I'd rather not write out the details but can if requested). Now we ask about all words of the form $ab^k$ and $ba^k$, which again tells us the exact number of times each run of the form $b^k$ is prefixed by an $a$ and vice versa (again by induction). What this will reveal is that among all the runs, there will be a unique one without a prefix; e.g. a unique run $b^k$ which does not extend to $ab^k$ (because the count for $ab^k$ is one less than the count for $b^k$) or vice versa. This is the run that begins our word. So far we've used at most $4L - 2$ words .

I'll assume WLOG that our word starts $a^{\ell_1}$. Now we ask about all words of the form $a^{\ell_1} b^k$; this will tell us which pairs consisting of a run $a^{\ell_1}$ followed by a run $b^k$ occur, which tells us what the next part of our word can be. There may be more than one possibility, corresponding to other runs of the form $a^{\ell_1}$ elsewhere in the word; now we ask as above about all words of the form $ba^{\ell_1} b^k$, which will tell us that exactly one $a^{\ell_1} b^k$ does not have a prefix as before. This establishes the next run $a^{\ell_1} b^{\ell_2}$ in our word. This step uses at most $2L - 3$ words.

Continuing in this way (again, I'd rather not write out the details but can if we requested) we establish the next run $a^{\ell_1} b^{\ell_2} a^{\ell_3}$ and inductively we establish the entire word. To establish $\ell_k$ requires using at most $2L - 2k + 1$ words. Altogether, we've used at most

$$(2L-1) + \sum_{k=1}^{L} (2L - 2k + 1) = L^2 + 2L - 1$$

words. For a general alphabet of size $A$ this needs to be multiplied by about $(A - 1)^2$.

This count is sloppy; the second factor of $L$ compared to the information-theoretic lower bound comes from the fact that we need to check whether long runs exist, but 1) long runs are rare and 2) when they exist they make the rest of the word shorter which reduces the number of remaining steps in the strategy and the number of words we need to use in those steps. A more careful analysis of the strategy should be able to reduce the bound on the number of words to closer to $O(L)$.

Old answer: Let's work with a binary alphabet for simplicity. There are $1 + 2 + \dots + 2^{\ell-1} = 2^{\ell} - 1$ words of length less than $\ell$ and a word can appear as a subword at most $L$ times, so if I ask you about each of them I get at most $L+1$ possible responses per question and hence at most $(L + 1)^{2^{\ell} - 1}$ possible responses total. So to guarantee recovery we need

$$(L + 1)^{2^{\ell} - 1} \ge 2^L$$

which shows that it doesn't suffice to take $\ell$ to be a constant. Taking logarithms twice gives

$$\ell \ge \log_2 \left( \frac{L}{\log_2(L+1)} + 1 \right).$$

This is the "information-theoretic lower bound," telling us what $\ell$ has to be so that we are even in principle getting enough bits of information to determine an arbitrary word of length $L$. If instead of short words we're content to minimize the size of the set of words we ask about, say we ask for $W$ words; then the lower bound is $(L + 1)^W \ge 2^L$ and taking logarithms as above gives

$$W \ge \frac{L}{\log_2(L+1)}.$$

How many subsequences do you need to look for to infer a word?

There are 1 best solutions below

Related Questions in COMBINATORICS

Related Questions in COMBINATORICS-ON-WORDS

Trending Questions

Popular # Hahtags

Popular Questions