I am struggling with a biochemistry question that when broken down is just a mathematics combinatorics problem. Hopefully this type of question is allowed on this Stack Exchange (apologies if it is not!).
I am trying to figure out every possible combination of a sequence of characters (chemical modifications on nucleotides) given a set of design constraints (to minimize toxicity and stability). To remove the requirement of domain expertise, I will simplify the problem to a combination of a characters in a string.
The string sequence must have 23 character combinations of [A, B, C, D, E]. The order of the characters in the sequence matter. The distribution of each character is not important, but the sequence must follow all the rules below:
- There can be at most 6 instances of
Aanywhere in the sequence. - There can be at most 13 instances of
AorBanywhere in the sequence. - There cannot be more than 2 consecutive instances of
C(e.g...ACCD..is allowed, while..ACCCD..is not allowed) -- this rule only applies toC.
How many combinations can be designed?
I know without any design rules, there are 5^23 combinations. I think I can figure out the number of combinations given one rule by subtracting all sequences of that rule from 5^23. However, I am stumped when there are multiple rules in play. I suspect I will have to make a Venn diagram and calculate all the overlaps, but I wonder if there is a different way to tackle this problem? I feel like this can be solved exactly, but if the problem is too specific (or complex to solve), is there at least a way I can make a rough estimate with some logical justification? I appreciate any input on this problem!
You can count the number of valid sequences exactly, with the help of a computer.
For each integer $k\in \{1,2,\dots,16\}$, we will find the number of sequences which have exactly $k$
C's, then sum over $k$. Our strategy is to first place theC's, then to place the other letters.To place the
C's, we need to choose a subset of $\{1,\dots,23\}$ with $k$ elements which has no three consecutive elements. In general, let $f(n,k)$ be the number of ways to choose a subset of $\{1,\dots,n\}$ with $k$ elements, no $3$ consecutive. You can prove that $$ f(n,k)=f(n-1,k)+f(n-2,k-1)+f(n-3,k-2) \qquad (n\ge 3,k\ge 2) $$ This recurrence allows you to compute $f(23,k)$ with a computer, using dynamic programming.Once the
C's are placed, there are $23-k$ slots remaining, to be filled withA's,B's,D's, andE's. If the number ofA's is $i$, then theA's can be placed in $\binom{23-k}i$ ways. Then, if the number ofB's is $j$, theB's can be placed in the remaining spots in $\binom{23-k-i}{j}$ ways. Finally, the remaining $23-k-i-j$ spots can be filled withDandEin $2^{23-k-i-j}$ ways. You then need to sum over all possible $i$ and $j$ which satisfy $0\le i\le 6$ and $0\le i+j\le 13$.Putting this all together, the number of valid sequences is $$ \sum_{k=0}^{16}\sum_{i=0}^6\sum_{j=0}^{13-i}f(23,k)\binom {23-k}i\binom{23-k-i}j2^{23-k-i-j} $$ I found that there were about $8.47\times 10^{15}$ valid sequences, which is about $71\%$ of the total number, using this Python code