How to measure local density of A's in a sequence of A's and B's?

18 Views Asked by At

Let me preface this by stating that I am not a mathmagician. I have a DNA sequencing problem, but it boils down to a math problem. But I don't have the math skills to adequately describe the issue, which makes it hard to google.

Let's assume we have a random DNA sequence of length N. We are interested in the A's, but there are also G, C, and T. Let's simplify it to A or Not A, where Not A = B. Therefore, a sequence of length N will have a number of possible sequences equal to $2^N$. So a sequence of 5 bases will have 32 possibilities. I want a method of scoring those possible sequences according to the number of A's and their spacing in the sequence. I want this score to estimate how well that sequence will bind to a sequence of all T's.

The 32 Possible Sequences for a 5-mer

The number of A's in the sequence is easy. More A's should give a higher score. But spacing is harder to quantify. A segment of consecutive A's should score higher than separated A's. A sequence of separated A's with small gaps should score higher than separated A's with large gaps.

Some Examples:

  • AAAAA - Highest possible score
  • BBBBB - Lowest possible score
  • ABBBB, BABBB, BBABB, BBBAB, and BBBBA should have the same score.
  • AAAAB should score higher than AAABA, which should be higher than AABAA though all are 80% A.
  • BAAAB should score higher than ABABA though both are 60% A.
  • BABAB should score higher than ABBBA though both are 40% A.

I don't need a single metric either, I can calculate multiple scores and see which correlates best with binding affinity.