Total number of DNA string alignments with spaces

339 Views Asked by At

I'm trying to create formula for the approximation of the total number of alignments between two DNA strings before beginning experimentation. I couldn't find literature that derives a formula to calculate all possible alignments including spaces for strings of unequal length.

Suppose I'm trying to align two strings: AAAAA, CCCCC

In the most extreme case, I can have an alignment of

 AAAAA-----
 -----CCCCC

I do not allow for aligned spaces. I do not have to have the maximum number of spaces in any individual alignment. Some other valid alignment cases:

 -AAAAA   --AAA---A   A-----AAAA    AA-AAA
 C-CCCC   CC---CCC-   -CCCCC----    C-CCCC

From what I could find about equal-length strings, for string of length n I can have at most r=n spaces, and n+r CHOOSE r ways to arrange the spaces in that string. The second string will have at most n CHOOSE r alignments as its spaces cannot be aligned to spaces on the first string. So the total number of alignments would be: $\sum\limits_{r=0}^{n}{n+r \choose r}{n \choose r}$.

Is it possible to derive a similar equation for strings of differing length? On the shorter string of length $m$ there would be a minimum of $n-m$ spaces to make up for the insufficient characters.

1

There are 1 best solutions below

0
On BEST ANSWER

One straightforward way to count these is to organize by the number of positions where both strings are non-spaces, call it $k$. The minimum value of $k$ is $0$ and the maximum is $m$ (retaining the assumption from the OP that $m \le n$). The total length of the alignment will be $m + n - k$. From these we can choose the $m$ positions where the second string has a non-space: there are $\binom{m+n-k}{m}$ ways to do this, and this fully determines the $n-k$ places where only the first string has a non-space.

Within the $m$ positions we selected, $k$ of them have two non-spaces and $m-k$ have a space in the first string. There are $\binom{m}{k}$ ways to arrange these, so the total number of alignments is:

$$\sum_{k=0}^m \binom{m+n-k}{m} \binom{m}{k}$$

(if you're familiar with multinomial coefficients, the summand can also be written as $\binom{n+m-k}{k,n-k,m-k}$).

This agrees with your analysis when $m=n$, making the simple observation that in that case $k = n-r$ so that $m+n-k = n + r$ (and of course using the horizontal symmetry of Pascal's triangle).