Simple Statistics/Probability Problem

77 Views Asked by At

I have used a python script to identify target sequences in a DNA sequence file.

There are two classes of sequence: coding and non-coding. I have identified $728$ sequences of interest. $597$ of these fall into the coding regions and $131$ of these fall into the non-coding regions. This is the equivalent of $18\%\,$ non-coding, but the total non-coding region in the sequence file is $13\% $.

Is there a statistical tool to demonstrate the python script identified target sequences in a non-random fashion way?

If the script identified sequences that were randomly distributed then $13\% $ of them would have been found in the non-coding region, from a total of $728$ sequences. This seems like it should be reliable.

I hope my question is clear.

1

There are 1 best solutions below

0
On BEST ANSWER

Your null hypothesis is $H_0: p = 0.13$ against the alternative $H_a: p \ne 0.13,$ where $p = P(\text{Non Coding}).$ You observe $X =131$ non-coding sequences among $n = 728$ observed, which gives you $\hat p = 0.1812$ as the observed frequency. Because the observed frequency is substantially different from $p = 0.13$ you wonder whether this might have been an 'unlucky' draw, or whether you have statistically significant evidence that the method of sampling is unfair.

This is called a "one-sample binomial test". Often this test is done by using a normal approximation to the binomial distribution. You can find that method in elementary statistics textbooks. The output below from Minitab statistical software uses the binomial distribution to give an exact P-value. [It seems that that SciPy also implements a version of this test, but I have not tried it.]

If the P-value is less than 5%, one says that the null hypothesis is rejected at the 5% level of significance. Here the P-value is printed as 0.000 which means that the P-value is smaller than 0.0005. So it is extremely unlikely that an unbiased draw would give an observed proportion of non-coding sequences so far from $p = 0.13.$

Test and CI for One Proportion 

Test of p = 0.13 vs p ≠ 0.13

                                                    Exact
Sample    X    N  Sample p         95% CI         P-Value
1       131  723  0.181189  (0.153769, 0.211239)    0.000

Another way to interpret the output is that a 95% confidence interval for $p$ is $(0.154, 0.211),$ which is centered at $\hat p = 0.1812,$ but does not contain $p = 0.13.$ Thus it is difficult to believe that the sampling procedure would have given close to the true value $p = 0.13.$


Note: Yet another approach is to note that quantiles .025 and .975 of the 'null distribution' $\mathsf{Binom}(n = 723, p = 0.13)$ are 77 and 112, respectively. Thus the observed value $X = 131$ falls considerably above the upper 'critical value' of the null distribution for a two-sided test at the 5% level. (Computation in R.)

 qbinom(c(.025,.975), 723, .13)
 [1]  77 112