Expectation and variance of occurrences of patterns in a random text

604 Views Asked by At

I'm not sure how to complete the following problem from the book An Introduction to Bioinformatics Algorithms:

Problem 9.1

Derive the expectation and the variance of the number of occurrences of patterns AA and AT in a random text of length n. Are the expectations and variances for AA and AT the same?

For AA, I think the expectation would be $\frac{n-1}{16}$ ($n-1$ possible 2-base positions times the probability of AA, $\frac{1}{16}$). But for AT, I'm not sure because this sequence cannot overlap (i.e. it can occurs at most $\frac{n}{2}$ times).

Also, I don't know how to calculate the variance.

Can anyone help me?

1

There are 1 best solutions below

1
On BEST ANSWER

For every $xy$ and every $k$, let $U_k(xy)$ denote the indicator of the event that the portion $\{k,k+1\}$ of the sequence is $xy$. One is interested in $$ N=\sum_{k=1}^{n-1}U_k(AA),\qquad M=\sum_{k=1}^{n-1}U_k(AT). $$ Obviously, $E[U_k(xy)]=\frac1{16}$ for every $k$ and $xy$, hence $E[N]=E[M]=\frac1{16}(n-1)$.

To compute the variances, start with $N^2$, say, which is $$ N^2=\sum_{k=1}^{n-1}U_k(AA)+2\sum_{k=1}^{n-2}U_k(AA)U_{k+1}(AA)+2\sum_{k=1}^{n-3}\sum_{\ell=k+2}^{n-1}U_k(AA)U_\ell(AA). $$ Surely you can compute the expectation of each term, sum them, and substract the square of the expectation. This will yield $\mathrm{var}(N)$ (which might be something like $\frac1{16}(5n-7)$).

Likewise for $M$, with the same technique but a different end result because $U_k(AT)U_{k+1}(AT)$ behaves differently than $U_k(AA)U_{k+1}(AA)$. This will yield $\mathrm{var}(M)$ (which might be something like $\frac3{16}(n-1)$).