This is a problem in a bioinformatics class, and I believe it shouldn't be too difficult probability-wise, but I've a novice in this area. I think I have about what I need, but I'm very unsure.
I have a table of species' nucleotide frequencies and a DNA sample ACCGGAGCTC, and I need to calculate some probabilities regarding how likely it is that my sample belongs to them.
Species | %A | %C | %G | %T
--------|----|----|----|----
A | 25 | 25 | 25 | 25
B | 20 | 30 | 30 | 20
C | 30 | 20 | 20 | 30
In addition, I know the relative abundances of the three species:
Species | % abundance
A | 50
B | 25
C | 25
Let $x = [2, 4, 3, 1]$ be the vector of counts of A, C, G, and T, respectively, in the sequence given.
Species | %A | %C | %G | %T
--------|----|----|----|---
sample | 20 | 40 | 30 | 10
With $X$ as the random variable representing the vector of counts of the bases within a sequence fragment and $Y$ being the random variable representing the species, I need to computer the probability of $x$ given it's species A, given it's species B, given it's species C, and in general. I also need to calculate the probability that it's species B given the sample (which I imagine will be trivial once I have the rest).
A
$$P(X = x\ |\ Y = A) = \frac{P(A\ |\ x)\cdot P(x)}{P(A)} = 0.0120163\cdot\frac{12600\cdot0.25^2\cdot0.25^4\cdot0.25^3\cdot0.25^1}{0.50} = 0.0001443914223$$
B
$$P(X = x\ |\ Y = B) = \frac{P(B\ |\ x)\cdot P(x)}{P(B)} = 0.0120163\cdot\frac{12600\cdot0.20^2\cdot0.30^4\cdot0.30^3\cdot0.20^1}{0.25} = 0.0002648988528$$
C
$$P(X = x\ |\ Y = C) = \frac{P(C\ |\ x)\cdot P(x)}{P(C)} = 0.0120163\cdot\frac{12600\cdot0.30^2\cdot0.20^4\cdot0.20^3\cdot0.30^1}{0.25} = 0.0000523256993$$
D -- The average nucleotide for each of A, C, T, G turns out to be 0.25 here, so
$$P(X = x) = \frac{10!}{2!\cdot4!\cdot3!\cdot1!}\cdot0.25^{10} = 0.0120163$$
E
$$P(Y = B\ |\ X = x) = \frac{P(X = x\ |\ Y = B)\cdot P(Y = B)}{P(X = x)} = \frac{0.0002648988528\cdot 0.25}{0.0120163} = 0.005511239999$$
You have been told the proportions of the four bases in the species ($0.25\%$ for each in species $\rm A$), so modelling the count --of bases in a sample from a species-- as a multinomial distributed random vector, the conditional probaility that the data (a count of $[2,4,3,1]$) might be generated from a given species $A$ is approximately: $$\mathsf P(X{=}x\mid Y{=}A)=\binom{10}{2,4,3,1} (0.25)^2 (0.25)^4 (0.25)^3 (0.25)^1$$
That is all, and similarly for the others. Bayes Rule is not used yet.
Tip: Don't worry about evaluating the multinomial coefficients until the final answer.
No, you cannot do that. All bases in the sample are selected from the same species. The data is only (approximately) multinomially distributed for a given species.
Use the Law of Total Probability. $$\mathsf P(X{=}x)=\mathsf P(X{=}x\mid Y{=}A)\mathsf P(Y{=}A)+\mathsf P(X{=}x\mid Y{=}B)\mathsf P(Y{=}B)+\mathsf P(X{=}x\mid Y{=}C)\mathsf P(Y{=}C)$$
Now use Bayes' Rule: $$\mathsf P(Y{=}B\mid X{=}x)=\dfrac{\mathsf P(X{=}x\mid Y{=}B)\mathsf P(Y{=}B)}{\mathsf P(X{=}x)}$$