Combinatorics with multiple design rules (e.g. no more than X instances, no more than X contiguous instances, etc.)

238 Views Asked by At

I am struggling with a biochemistry question that when broken down is just a mathematics combinatorics problem. Hopefully this type of question is allowed on this Stack Exchange (apologies if it is not!).

I am trying to figure out every possible combination of a sequence of characters (chemical modifications on nucleotides) given a set of design constraints (to minimize toxicity and stability). To remove the requirement of domain expertise, I will simplify the problem to a combination of a characters in a string.

The string sequence must have 23 character combinations of [A, B, C, D, E]. The order of the characters in the sequence matter. The distribution of each character is not important, but the sequence must follow all the rules below:

  • There can be at most 6 instances of A anywhere in the sequence.
  • There can be at most 13 instances of A or B anywhere in the sequence.
  • There cannot be more than 2 consecutive instances of C (e.g. ..ACCD.. is allowed, while ..ACCCD.. is not allowed) -- this rule only applies to C.

How many combinations can be designed?

I know without any design rules, there are 5^23 combinations. I think I can figure out the number of combinations given one rule by subtracting all sequences of that rule from 5^23. However, I am stumped when there are multiple rules in play. I suspect I will have to make a Venn diagram and calculate all the overlaps, but I wonder if there is a different way to tackle this problem? I feel like this can be solved exactly, but if the problem is too specific (or complex to solve), is there at least a way I can make a rough estimate with some logical justification? I appreciate any input on this problem!

2

There are 2 best solutions below

3
On BEST ANSWER

You can count the number of valid sequences exactly, with the help of a computer.

For each integer $k\in \{1,2,\dots,16\}$, we will find the number of sequences which have exactly $k$ C's, then sum over $k$. Our strategy is to first place the C's, then to place the other letters.

To place the C's, we need to choose a subset of $\{1,\dots,23\}$ with $k$ elements which has no three consecutive elements. In general, let $f(n,k)$ be the number of ways to choose a subset of $\{1,\dots,n\}$ with $k$ elements, no $3$ consecutive. You can prove that $$ f(n,k)=f(n-1,k)+f(n-2,k-1)+f(n-3,k-2) \qquad (n\ge 3,k\ge 2) $$ This recurrence allows you to compute $f(23,k)$ with a computer, using dynamic programming.

Once the C's are placed, there are $23-k$ slots remaining, to be filled with A's, B's, D's, and E's. If the number of A's is $i$, then the A's can be placed in $\binom{23-k}i$ ways. Then, if the number of B's is $j$, the B's can be placed in the remaining spots in $\binom{23-k-i}{j}$ ways. Finally, the remaining $23-k-i-j$ spots can be filled with D and E in $2^{23-k-i-j}$ ways. You then need to sum over all possible $i$ and $j$ which satisfy $0\le i\le 6$ and $0\le i+j\le 13$.

Putting this all together, the number of valid sequences is $$ \sum_{k=0}^{16}\sum_{i=0}^6\sum_{j=0}^{13-i}f(23,k)\binom {23-k}i\binom{23-k-i}j2^{23-k-i-j} $$ I found that there were about $8.47\times 10^{15}$ valid sequences, which is about $71\%$ of the total number, using this Python code

4
On

We consider a $5$-ary alphabet built from letters $\{A,B,C,D,E\}$. Words which do not have any consecutive equal letters are called Smirnov words. A generating function for Smirnov words is given as \begin{align*} \color{blue}{\left(1-\frac{5z}{1+5z}\right)^{-1}}\tag{1} \end{align*} The coefficient $[z^n]$ of $z^n$ in the series (1) gives the number of $5$-ary words of length $n$ which do not have any consecutive letters.

In the current example we are looking for words which do not contain a bad word from $\{CCC\}$ and where the number of occurrences of $A$ is at most $6$ and the number of occurrences with $A$ together with $B$ is at most $13$. The letters $D$ and $E$ do not have any restriction at all. We use (1) as basis and perform following substitutions:

\begin{align*} z&\quad\to\quad sz+s^2z^2+\cdots=\color{blue}{\frac{sz}{1-sz}}\tag{$\to\ A$}\\ z&\quad\to\quad tz+t^2z^2+\cdots=\color{blue}{\frac{tz}{1-tz}}\tag{$\to\ B$}\\ z&\quad\to\quad z+z^2=\color{blue}{z(1+z)}\tag{$\to\ C$}\\ z&\quad\to\quad z+z^2+\cdots=\color{blue}{\frac{z}{1-z}}\tag{$\to\ D, E$}\\ \end{align*}

  • The number of occurrences of $A$ is allowed to be at most $6$. We mark each occurrence of $A$ with $s$ to respect this in the calculation.

  • We also want no more than $13$ occurrences of $A$ and $B$. Therefore we mark occurrences of $B$ with $t$.

  • We substitute occurrences of $C$ by $z+z^2$ avoiding this way runs of $C$ of length $3$.

  • Since there are no restrictions for $D$'s and $E$'s, we substitute $z$ with $z+z^2+\cdots=\frac{1}{1-z}$ for these two cases.

We transform (1) performing the substitutions given above and obtain with some help of Wolfram Alpha \begin{align*} \color{blue}{A(z;s,t)}&=\left(1-\frac{\frac{sz}{1-sz}}{1+\frac{sz}{1-sz}}-\frac{\frac{tz}{1-tz}}{1+\frac{tz}{1-tz}}\right.\\ &\qquad\qquad\left.-\frac{z(1+z)}{1+z(1+z)}-\frac{2\frac{z}{1-z}}{1+\frac{z}{1-z}}\right)^{-1}\\ &\,\,\color{blue}{=\left(1-(s+t)z-\frac{3z+3z^2+2z^3}{1+z+z^2}\right)^{-1}}\tag{2} \end{align*} The wanted number is \begin{align*} \color{blue}{\sum_{k=0}^{6}\sum_{j=0}^{13-k}[z^{23}s^kt^j]A(z;s,t)} &\color{blue}{= 8\,472\,450\,413\,729\,761}\\ &\color{blue}{\doteq 8.472\cdot 10^{15}}\tag{3} \end{align*} in accordance with the result of @MikeEarnest. The calculation was done thanks to the support of @MikeEarnest with this Mathematica code.

Note: Smirnow words can be found for instance in example III.24 in Analytic Combinatorics by P. Flajolet and R. Sedgewick.