Modeling the probability of $k$-mer collisions between DNA sequences

59 Views Asked by At

Let's imagine that I have a DNA sequence of known origin. Such a sequence can simply be thought of as a string of characters $(A|C|G|T)^l$ where $l$ is the length of the sequence. For purposes of this problem, we can also convert any sequence into a set of distinct $k$-mers where $k$ is the length of a substring (e.g. 'ATGGT' is a $5$-mer). Such a set could be defined as $\{x\in (A|C|G|T)^k |~x$ is a substring in the sequence $\}$

Now, imagine that we generate a completely random string of characters $(A|C|G|T)^m$ where $m$ is the length of this random sequence and $m<<l$. For simplicity, we can assume that the chance of each character is equal. Once again, I can convert this random string of characters into a set of distinct $k$-mers with the same $k$ as before.

Finally, to the problem:

I'd like to model the probability that I observe a particular number of shared $k$-mers or more between the random sequence and the known one.

For example, let's say that I am using $k=20$ and have a known sequence with $4,500,000$ distinct $20$-mers. Next, I generate a random sequence of $m=500$. Such a sequence has at most $500-20+1=481$ distinct $20$-mers (although it could be less, let's assume that it actually has 481) and I observe that $10$ of these $481$ $20$-mers are in the known sequence. What is the probability that $10$ or more $k$-mers are shared with the known sequence?

My original intuition:

Much like a classic experiment to test for the fairness of a coin, this could be modeled with a Binomial distribution. The known sequence has $4,500,000$ out of the total $4^{20}$ possible $20$-mers. This would give us our value of $p$ ($p=\frac{4500000}{4^{20}}$). The "number of trials" $n$ in this case is the number of distinct $20$-mers in the random sequence. Therefore, this could be modeled with $B(n=481, p\approx 0.00000409272)$. The probability of observing $10$ or more collisions is more or less obvious after that point.

The issue:

For 1 extremely obvious reason I can think of, I cannot assume that the trials within my Binomial model are independent:

  • Logically, each distinct $k$-mer overlaps substantially with $k$-mers on either side of it within the string. (e.g. If 'ATGGT' is within either sequence 'TGGTX' where $X \in \{A, C, G, T\}$ must be there as well, assuming that the first $5$-mer was not at the end of my sequence)

This leads me at last to my actual question: Is there a better way of modeling this problem statistically? (E.g a better distribution to use or a different way entirely)

1

There are 1 best solutions below

1
On

I expect that the approach you outline will be a perfectly reasonable one, and I doubt that you'll need a "better" approximation. Independence does not actually hold, but in my experience the effect of the deviation from independence does not affect the final answer too much (except in particularly extreme corner cases).

The heuristic, which is likely to work well enough in practice in nearly all cases, is to assume independence. In other words, treat "is the $i$th $k$-mer of the long sequence equal to the $j$th $k$-mer of the short sequence?" as iid Bernoulli random variables. In other words, the probability of a collision (match) for $i,j$ is independent of a match for $i',j'$, for all other $i',j'$.

With this heuristic, we can compute that the probability of a collision in positions $i,j$ is $1/4^k$. Therefore, under this heuristic, the number of collisions overall is Bernoulli$(n,p)$ where $n$ is the number of such position-pairs and $p=1/4^k$. In your specific example, there are $n= 4.5 \cdot 10^6 \times 481$ such position-pairs, so $n \approx 2.16 \cdot 10^9$. Now if $X$ is such a Bernoulli r.v., then

$$\Pr[X \ge 10] = \sum_{i=10}^{n} {n \choose i} p^i (1-p)^{n-i}.$$

For the particular parameters you have chosen, this is well-approximated by

$$\Pr[X \ge 10] \approx {n \choose 10} p^{10} (1-p)^{n-10} \approx 6.22 \cdot 10^{86} \times 4^{-200} \times 0.998 \approx 0.$$

You should be able to generalize to any other number you care about.

This is very similar to the approach you used, or will give very similar results.

Some caveats: The above heuristic assumes that repeated $k$-mers in the long sequence are rare (or, at least, similar to what would be observed if the long sequence were chosen randomly). If that assumption is not valid, then the heuristic might not apply. Also, in the above, what matters is not the number of distinct $k$-mers in the long sequence, but the total number of $k$-mers in the long sequence (basically, the length of the long sequence, minus $k-1$).