Number of binary sequences with exactly $n$ distinct subsequences.

189 Views Asked by At

Prove that number of binary sequences that have exactly $n$ distinct subsequences (including empty one) is $\varphi(n+1)$(where $\varphi(n)$ is Euler's totient function). For example, There are $\varphi(8)=4$ sequences with $7$ distinct subsequences: $$\{000000,111111,010,101\}$$ I came up with this question while solving some other problem and wrote a simple program to calculate it for small numbers and when I searched for the sequence on OEIS, to my surprise it was same as Euler's totient function, but I couldn't prove this result.

1

There are 1 best solutions below

8
On BEST ANSWER

Here is a sketch of the proof.

First of all, the following construction is well-known: starting from the interval $\begin{bmatrix}\frac{0}{1}&\frac{1}{1}\end{bmatrix}$, we divide it into two (by placing $\frac{0+1}{1+1}=\frac{1}{2}$ in the middle), then divide those two intervals (by placing $\frac{0+1}{1+2}=\frac{1}{3}$ and $\frac{1+1}{1+2}=\frac{2}{3}$ in the middle(s) etc. At each step, the sub-interval bounded by $\frac{p}{q}$ and $\frac{r}{s}$ is split by a midpoint $\frac{p+r}{q+s}$. We would say that the fractions $\frac{p}{q}$ and $\frac{r}{s}$ are "parents" of the fraction $\frac{p+r}{q+s}$.

One can prove that all those fractions are reduced, and that sooner or later they will enumerate all rational numbers between $0$ and $1$.

Here is what the interval will look like after $5$ divisions:

$$\begin{bmatrix}\frac{0}{1}&\frac{1}{6}&\frac{1}{5}&\frac{2}{9}&\frac{1}{4}&\frac{3}{11}&\frac{2}{7}&\frac{3}{10}&\frac{1}{3}&\frac{4}{11}&\frac{3}{8}&\frac{5}{13}&\frac{2}{5}&\frac{5}{12}&\frac{3}{7}&\frac{4}{9}&\frac{1}{2}&\frac{5}{9}&\frac{4}{7}&\frac{7}{12}&\frac{3}{5}&\frac{8}{13}&\frac{5}{8}&\frac{7}{11}&\frac{2}{3}&\frac{7}{10}&\frac{5}{7}&\frac{8}{11}&\frac{3}{4}&\frac{7}{9}&\frac{4}{5}&\frac{5}{6}&\frac{1}{1}\end{bmatrix}$$

What has this got to do with sequences of $0$'s and $1$'s?

We can label the rational numbers on this interval (all except $0/1$ and $1/1$) using the following algorithm: label the midpoints on each "level $n$" (i.e. added at the division step $n$) using all the words of length $n-1$ in lexicographic order, left-to-right. This means, we label $\frac{1}{2}$ with the empty word $\epsilon$, and then we label $\frac{1}{3}$ and $\frac{2}{3}$ with $0$ and $1$, respectively, etc. We get a correspondence like this one:

$$\begin{bmatrix}\frac{0}{1}&\frac{1}{6}&\frac{1}{5}&\frac{2}{9}&\frac{1}{4}&\frac{3}{11}&\frac{2}{7}&\frac{3}{10}&\frac{1}{3}&\frac{4}{11}&\frac{3}{8}&\frac{5}{13}&\frac{2}{5}&\frac{5}{12}&\frac{3}{7}&\frac{4}{9}&\frac{1}{2}&\frac{5}{9}&\frac{4}{7}&\frac{7}{12}&\frac{3}{5}&\frac{8}{13}&\frac{5}{8}&\frac{7}{11}&\frac{2}{3}&\frac{7}{10}&\frac{5}{7}&\frac{8}{11}&\frac{3}{4}&\frac{7}{9}&\frac{4}{5}&\frac{5}{6}&\frac{1}{1}\\\text{n/a}&0000&000&0001&00&0010&001&0011&0&0100&010&0101&01&0110&011&0111&\epsilon&1000&100&1001&10&1010&101&1011&1&1100&110&1101&11&1110&111&1111&\text{n/a}\end{bmatrix}$$

We can extend the "parent" terminology, and for each sequence of $0$'s and $1$'s call the sequences corresponding to the parents of the fraction corresponding to the given sequence - the "parents" of the sequence. For example, the parents of $0101$ (corresponding to $\frac{5}{13}$) are $010$ (corresponding to $\frac{3}{8}$) and $01$ (corresponding to $\frac{2}{5}$). As we did not associate any sequences to the fractions $\frac{0}{1}$ and $\frac{1}{1}$, the left-most sequence ($000\ldots 0$) and the right-most sequence ($111\ldots 1$) has only one parent.

What is, I think, the crucial observation here is that the number of different subsequences exactly matches the denominator minus $1$ (!)

For example, the sequence $0101$ has the following $12$ subsequences: $\epsilon, 0, 1, 00, 01, 10, 11, 001, 010, 011, 101, 0101$. On the other hand, the fraction that corresponds to it is $\frac{5}{13}$ and $12=13-1$.

Let's take it on faith that this correspondence can be proven. The consequence is that, in the "grand" table (with all rational numbers between $0$ and $1$) the number of binary sequences with exactly $n$ different subsequences will be the same as the number of reduced fractions between $0$ and $1$ with denominator $n+1$. However, all reduced fractions will feature in the "grand" table somewhere, and so the number of those equals the number of potential numerators (coprime to $n+1$) - which is $\varphi(n+1)$. $\Box$


Now, what is left is to prove the correspondence. I have a sketch of a proof below, which goes by induction on the "level" of the particular fraction/binary sequence, which is the same as the induction on the length of the binary sequence itself. A hint that it is possible comes from the aforementioned sequence $0101$ which corresponds to the fraction $\frac{5}{13}$. The parents of the fraction $\frac{5}{13}$ are $\frac{3}{8}$ and $\frac{2}{5}$, which correspond to the two parents of our original binary sequences: $010$ and $01$. Note also the denominator $13=5+8$ (as per construction), i.e. $(13-1)=(5-1)+(8-1)+1$. This hints that we should be able to prove that:

The number of different subsequences corresponding to a sequence is the sum of the numbers of different subsequences corresponding the parents of that sequence, plus $1$.

If we could prove that, then the full correspondence would follow by induction.

To demonstrate, look at the different subsequences of $0101$ - again:

$$\begin{array}{l}\epsilon\\\color{red}{0}\\\color{blue}{1}\\0\color{red}{0}\\0\color{blue}{1}\\1\color{red}{0}\\1\color{blue}{1}\\00\color{blue}{1}\\01\color{red}{0}\\01\color{blue}{1}\\10\color{blue}{1}\\010\color{blue}{1}\end{array}$$

and also note the subsequences for $01$: $\epsilon, 0, 1, 01$ and for $010$: $\epsilon, 0, 1, 00, 01, 10, 010$.

What we get is: either $\epsilon$ (which will give us "plus $1$" in the general statement), or a subsequence of $01$ followed by $\color{red}{0}$, or a subsequence of $010$ followed by $\color{blue}{1}$.

Now for the (sketch) proof of the general correspondence. As said before, we only need to prove the inductive step outlined above.

If the sequence is $\underbrace{000\ldots 0}_n$, its only parent is $\underbrace{000\ldots 0}_{n-1}$, and the number of different subsequences in the given sequence is $n+1$ and in the parent it is $n$, so the inductive step is valid. This works similarly for the sequence $\underbrace{111\ldots 1}_n$.

Suppose that, now, $s_1s_2\ldots s_n$ is a sequence which has both $0$'s and $1$'s in it. Here is another observation (which I will still leave without proof, which is why the whole proof here is sketchy; I am, however, convinced enough that it is true):

Observation: If $s=s_1s_2\ldots s_n$ is a binary sequence containing both $0$'s and $1$'s, then its parents are precisely:

  • $s_1s_2\ldots s_{n-1}$
  • $s_1s_2\ldots s_{k-1}$ where $s_k$ is the last occurrence of the symbol different from $s_{k+1}=\ldots=s_n$.$\Box$

In other words, to get the parents, you: (a) remove the last symbol; (b) remove the last symbol, and all the symbols preceding it which are the same as it, and additionally remove the last symbol different from it.

Now, one can see that the number of different subsequences of $s$ is given by the sum of the numbers of the different subsequences of the parents, plus $1$. It is because $s$ has the following subsequences:

  • The empty sequence $\epsilon$
  • The sequences obtained by augmenting subsequences of $s_1s_2\ldots s_{n-1}$ (the first parent above) with $s_n$
  • The sequences obtained by augmenting subsequences of $s_1s_2\ldots s_{k-1}$ (the second parent above) with $s_k\ne s_n$ - the last occurence of the symbol $\ne s_n$.

This is true because every nonempty subsequence of $s$ ends either with $s_n$ or with the opposite symbol, and we can always assume that this symbol has actually come from $s_n$ or from $s_k$.$\Box$