Question:
DNA is composed of two strands carrying a sequence of nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T). Nucleotides along the two strands pair (bond) with one another: A with T and G with C.
A linear DNA fragment has $400000$ nucleotide pairs and we can assume that the share of A-T pairings equals G-C. If a probe is held at one end of the fragment and launched to scan for a particular sequence "GAATTC", how many successful scans do you expect after the probe reaches the other end?
My approach:
Our scan sequence has $6$ nucleotides. $4^{6}$ different combinations of a six-nucleotide sequence are possible. Out of them, ONLY one sequence matches our probe. Therefore, for every $4^{6}\times6 = 24576$ nucleotides, we will get one successful scan. Thus, the expected number of successful scans = $\frac{400000}{24576} = 16.27 \approx 16$
However, the given answer is 97 and I appear to be undercounting by a factor of six. I did some more examples and I am always undercounting by the number of nucleotides in the recognition sequence. Will appreciate some insight into where I am going wrong.
The (proved) source of error:
Someone asks in the comment below if not the multiplication with $6$ would have been an obvious place to look. Let's take a shorter example of a 2 nucleotide long recognition tag (say, AG).
$4^{2} = 16$ 2-nucleotide seqs are possible. Out of which one matches!
(AA)(GG)(TT)(CC)(AG)(AT)(AC)(GA)(GT)(GC)(CA)(CG)(CT)(TA)(TG)(TC)
One match every $32$ $(4^{2} \times 2)$ nucleotides on average. [You might rearrange the order of the bracketed pairs but the probability of match remains same.]