Let me preface this by stating that I am not a mathmagician. I have a DNA sequencing problem, but it boils down to a math problem. But I don't have the math skills to adequately describe the issue, which makes it hard to google.
Let's assume we have a random DNA sequence of length N. We are interested in the A's, but there are also G, C, and T. Let's simplify it to A or Not A, where Not A = B. Therefore, a sequence of length N will have a number of possible sequences equal to $2^N$. So a sequence of 5 bases will have 32 possibilities. I want a method of scoring those possible sequences according to the number of A's and their spacing in the sequence. I want this score to estimate how well that sequence will bind to a sequence of all T's.
The 32 Possible Sequences for a 5-mer
The number of A's in the sequence is easy. More A's should give a higher score. But spacing is harder to quantify. A segment of consecutive A's should score higher than separated A's. A sequence of separated A's with small gaps should score higher than separated A's with large gaps.
Some Examples:
- AAAAA - Highest possible score
- BBBBB - Lowest possible score
- ABBBB, BABBB, BBABB, BBBAB, and BBBBA should have the same score.
- AAAAB should score higher than AAABA, which should be higher than AABAA though all are 80% A.
- BAAAB should score higher than ABABA though both are 60% A.
- BABAB should score higher than ABBBA though both are 40% A.
I don't need a single metric either, I can calculate multiple scores and see which correlates best with binding affinity.