I have the following problem from my homework set: The human genome is 3 billion bp in length, if we only consider one strand. The complement strand has the same information content, so for many purposes this is a reasonable and very common simplification. When looking for motifs and short sequence matches, however, we need to consider both strands of DNA, so here the genome is 6 billion bp long. For this problem we will assume that the frequency of all four nucleotides is the same $(p(A)=p(T)=p(C)=p(G)=0.25)$.
A) What is the probability that the nucleotide at a random position i is a C? (1 point)
B) If you have the sequence AATTG (a "5mer"), what is the probability that you will find that sequence starting at the same randomly chosen position i? (2 points)
C) What is the probability of NOT finding this sequence at all in the entire human genome? (5 points)
D) You are given sequences ranging from 5bp to 20bp in length. For each of these sequences, what is the probability of finding that sequence once and only once in the human genome (use the same assumptions about genome composition)? (15 points)
I have been able to find the answers for the first three parts.
A) $.25$
B) $.25^5$
C) $(1-.25^5)^{6,000,000,000}$
I am having difficulty determining how to find a kmer once and only once. I thought I might use the binompdf command on my Ti-89 but I don't know if that would be correct. Any help is appreciated, thank you.
You have the right idea using a binomial distribution:
For each block of five nucleotides, flip a coin that comes up heads (success) with probability $p$. We define success as finding our specific sequence, so $p=0.25^5$. We require that we only have a single success in our $ 6\cdot10^9-5+1$ flips, so the probability of this is $$\binom{6\cdot10^9-5+1}{1}\times p(1-p)^{6\cdot10^9-1}\approx 6\cdot10^9 \times p(1-p)^{6\cdot10^9}\approx 0.$$