Biology: How to find the probability of randomly generating multiple, sequentially identical sets

1k Views Asked by At

If I randomly generate a substring (example "ATGCAGC") with equal probability (1/X where X=4) for each digit with length (L) digits: What is the formula for finding the probability (P) of randomly generating that sequence (T) times, given a total string length (N)?

Example: Given "ATGCAGC" string length L=7, number of possible characters X=4 with equal probability of being randomly generated 1/X.

In a case where N characters are generated, what is the probability that an exact substring with length L will occur T times?

If I have randomly, sequentially generated N=7000 characters, what is the probability that any exact substring length L=7 "ATGCAGC" will occur T=2,3,4... times?

P is my dependent variable. L, T, N, X are independent.

In terms of dice:

Example: If I sequentially roll a X=6 sided die N=7000 times: What is the P=probability I will roll the die sequentially the same (1,4,6,5,3,2,3) with sequence length L=7 for T=2 sequentially identical occurrences in the N=7000 sequential rolls of a single die?

What is the probability in 7000 rolls I will have any 2 runs of 7 throws that have an exact sequential match? Example: (1,4,6,5,3,2,3 on rolls 201-207) and (1,4,6,5,3,2,3) on rolls 5001-5007. It could be any number of (T) occurrences, on any roll numbers in (N) total die rolls.

I am specifically solving for the probability, given any values for the independent variables. Overlapping or non-overlapping substrings or both are great.

My question is related to (How many times will a consecutive sequence of throws randomly appear if I throw a four-sided die N times?)

2

There are 2 best solutions below

3
On BEST ANSWER

The question of a probability of a pattern repeating exactly $T$ times does not have a simple formula. I have obtained the probability for the pattern ``ATGCAGC''

A directed graph was used to keep track of the patterns. For $T=1$ itself it's big:

\begin{align*} A_1 &= \left(\begin{array}{rrrrrrrrrrrrrr} 3 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 2 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 2 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 2 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 3 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 2 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 0 & 0 & 0 & 0 & 0 \end{array}\right) \end{align*}

for $T=2$,

\begin{align*} A_2 &= \left(\begin{array}{rrrrrrrrrrrrrrrrrrrrr} 3 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 2 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 2 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 2 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 3 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 2 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 3 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 0 & 0 & 0 & 0 & 0 \end{array}\right) \end{align*}

That pattern of matrix continues for $T=3,4,\ldots$, and the entries of interest are the last 7 entries of the first row in $A^{7000}$. Adding those and dividing by $4^{7000}$ gives the required probabilities. Below are the probabilities obtained on computing:

\begin{align*} \begin{array}{|l|c|}\hline T & P \\\hline 1 & 0.278730925452149 \\ 2 & 0.059428898295207 \\ 3 & 0.008431600808656 \\ 4 & 0.000895516555326\\ \hline \end{array} \end{align*}

0
On

Very Crude method:

If there are X characters, there can be $X^L$ different number of ways you can generate a sequence with each X have $\frac{1}{X}$ probability. Then each such sequence is equally likely.

Thus the probability that a sequence would appear out of the $X^L$ different ways is $\frac{1}{x^L}$.

Then the total number of characters N that are generated will be split into $[\frac{N}{L}]$ blocks of those substrings.

The probability that a substring other than the desired happens in any block is $(1-\frac{1}{X^L})$.

Now define the the random variable T, the number of times such substring should occur where t = 0,1,... [N/L].

Take a case when T = 2, this substring should occur twice and the rest of the $[\frac{N}{L}] - T$ blocks should be other substrings of length L other than the desired for which we calculated the probability a couple of steps above. For N = 100 and L = 5, you will have 20 blocks with 5 strings each. You check the desired substring in each of these blocks. Then shift 1 character and check from 2-6,7-11,... and again shift 1 character from 3-8, 9-12,.... This you do it till you shift 5 times then you get to the original sequence 6-10,11-15,... Now you have in a way scanned all the N characters in 20 blocks five times to cover all consecutive appearances. Once way you could accomodate is to multiply (the probability of finding substring) by L which in this case is 5.

Now we can safely assume these appearances of substrings to be a Binomial Distribution.

Thus

$$P( T = t) =\left({[\frac{N}{L}]\choose t} {(L*\frac{1}{x^L})}^t {(1-L*\frac{1}{X^L})}^{[\frac{N}{L}]-t}\right)$$

I hope this gives a real good approximation for the task that you have laid out.

Thanks

Satish