How many binary strings (with a given number of occurrences of 0 and 1) are there that do not contain a given substring?

1.8k Views Asked by At

I know my binary string is composed of exactly $n$ $1$s and $m$ $0$s. How many such strings are possible, if we add the constraint that they must not contain a specific given substring $S$ (whose length is $\leq n+m$)?

I am specifically interested in the answer in the case that $S=010$.

Note: I know how to determine the answer programatically / via dynamic programming. I'm looking for a more closed form / combinatoric solution.

For example, if $n=3$, $m=2$, and $S=010$, then the following would be all $7$ relevant ways: $$00111$$ $$01101$$ $$01110$$ $$10011$$ $$10110$$ $$11001$$ $$11100$$

3

There are 3 best solutions below

6
On BEST ANSWER

This answer is based upon the Goulden-Jackson Cluster Method.

We consider the set words of length $n\geq 0$ built from an alphabet $$\mathcal{V}=\{0,1\}$$ and the set $B=\{010\}$ of bad words, which are not allowed to be part of the words we are looking for. We derive a generating function $f(s)$ with the coefficient of $s^n$ being the number of searched words of length $n$.

According to the paper (p.7) the generating function $f(s)$ is \begin{align*} f(s)=\frac{1}{1-ds-\text{weight}(\mathcal{C})}\tag{1} \end{align*} with $d=|\mathcal{V}|=2$, the size of the alphabet and $\mathcal{C}$ the weight-numerator of bad words with \begin{align*} \text{weight}(\mathcal{C})=\text{weight}(\mathcal{C}[010]) \end{align*}

We calculate according to the paper \begin{align*} \text{weight}(\mathcal{C}[010])&=-s^3-s^2\text{weight}(\mathcal{C}[010])\tag{1}\\ \end{align*}

and get \begin{align*} \text{weight}(\mathcal{C})=-\frac{s^3}{1+s^2} \end{align*}

It follows

\begin{align*} f(s)&=\frac{1}{1-ds-\text{weight}(\mathcal{C})}\\ &=\frac{1}{1-2s+\frac{s^3}{1+s^2}}\tag{2}\\ &=\frac{1+s^2}{1-2s+s^2-s^3}\\ &=1+2s+4s^2+7s^3+12s^4+\color{blue}{21}s^5\\ &\qquad+37s^6+65s^7+114s^8+200s^9+351s^{10}+\cdots \end{align*}

The last line was calculated with the help of Wolfram Alpha. The coefficient of $s^5$ shows there are $\color{blue}{21}$ valid words of length $5$ from the alphabet $\{0,1\}$ which do not contain the word $010$.

But we want to also keep track of the number of $0$'s and $1$'s. We get a refinement of $f(s)$ by marking the $0$ with $x$ and the $1$'s with $y$. We obtain from (1) \begin{align*} \text{weight}(\mathcal{C}[010])&=-x^2ys^3-xys^2\text{weight}(\mathcal{C}[010]) \end{align*} and get \begin{align*} \text{weight}(\mathcal{C})=-\frac{x^2ys^3}{1+xys^2} \end{align*}

Using this generalized weight we obtain from (2) a generating function $g(s;x,y)$ \begin{align*} g(s;x,y)&=\frac{1}{1-(x+y)s+\frac{x^2ys^3}{1+xys^2}}\\ &=\frac{1+xys^2}{1-(x+y)s+xys^2-xy^2s^3}\\ &=1+(x+y)s+(x^2+2xy+y^2)s^2+(x^3+2x^2y+3xy^2+y^3)\\ &\qquad+(x^4+2x^3y+4x^2y^2+4xy^3+y^4)s^4\\ &\qquad+(x^5+2x^4y+5x^3y^2+\color{red}{7}x^2y^3+5xy^4+y^5)s^5+\cdots \end{align*}

So, e.g. out of $2^5=32$ binary words of length $5$ there are $\color{blue}{21}$ valid words which do not contain $010$ and $\color{red}{7}$ of them contain $n=2$ zeros and $m=3$ ones:

\begin{array}{cccc} \qquad\color{blue}{00000}\qquad&\qquad01000\qquad&\qquad\color{blue}{10000}\qquad&\qquad\color{blue}{11000}\qquad\\ \color{blue}{00001}&01001&\color{blue}{10001}&\color{red}{11001}\\ 00010&01010&10010&11010\\ \color{blue}{00011}&01011&\color{red}{10011}&\color{blue}{11011}\\ 00100&\color{blue}{01100}&10100&\color{red}{11100}\\ 00101&\color{red}{01101}&10101&\color{blue}{11101}\\ \color{blue}{00110}&\color{red}{01110}&\color{red}{10110}&\color{blue}{11110}\\ \color{red}{00111}&\color{blue}{01111}&\color{blue}{10111}&\color{blue}{11111}\\ \end{array}

0
On

A recurrence and generating function.

For the case $S=010$, let $T_{n,m}$ be the number of ways.

$$T_{n,0}=1, T_{n,1}=n+1, T_{0,m}=1, T_{1,m}=2$$

The total with $n,m>1$, can be computed by recursion.

There are $T_{n-1,m}$ cases that end with $1$.

There are $T_{n,m-2}$ cases that end with $00$.

There are $T_{n-2,m-1}$ cases that end with $110$.

So $$T_{m,n} = T_{n-1,m}+T_{n,m-2} + T_{n-2,m-1}$$

From there, it is going to take some work to get a closed formula. You can probably use generating functions to get one. Letting:

$$f(x,y)=\sum_{m,n} T_{n,m}x^ny^m$$

The recurrence should give us something simple for:

$$f(x,y)(1-x-y^2-x^2y)=\sum_{m,n} a_{n,m}x^ny^m$$

The recurrense means that if $m,n>1$, then $a_{n,m}=0$.

My brain is having trouble getting the rest of this.

6
On

There are a total of $\frac{(m+n)!}{m!n!}$ strings of binary sequences with $m$ ones and $n$ zeros. There are $m+n-k*(S_m-S_n)+1$ spots where you could have $k$ strings $S$, where $S_n$ is the number of zeros in the string and $S_m$ is the number of ones in the string. The full formula is then $$\frac{(m+n)!}{m!n!}-\sum_{k=1}^G \frac{(m-kS_m+n-kS_n+k)!}{(m-kS_m)!(n-kS_n)!k!}$$ where $G=\left\lfloor\frac{m+n}{S_m+S_n}\right\rfloor$.