emmission probabilities in a hidden markov model with 2 states and an alphabet of 4 characters

54 Views Asked by At

I'm reading through a text that is describing how to use use hidden markov models to identify areas of biological sequences that correspond to specific biological features. It starts with a simple example of a sequence of DNA that may or may not have and increased concentration of "CG" dinucleotides (also known as "CpG islands").

This site has notes from the text I am using. I'm writing about the section entitled "The Hidden Markov Model used as the model"

I'm running into a bit of confusion at a part in the text that is talking about emission probabilities. Emission probabilities are defined as: $$e_k(b) = P(x_i=b|\pi_i=k)$$ that is, the emission probability of letter $b$ in the state $k$ is the probability of $b$ being in a sequence in state $k$. The text says that this will always be 0 or 1... why??

My intuition is that since there are 4 possible nucleotides in a sequence, so the emission probability should follow the requirement that: $$\sum\limits_{i}{e_k(i)} = 1$$

Can anyone help me understand what I'm misunderstanding here?

Thanks!

1

There are 1 best solutions below

0
On BEST ANSWER

I think I figured this out:

If you are considering the states as CpG+ or CpG-, then the emission probabilities will not always be 0 or 1.

However, if you are considering them to be A+, C+, G+, T+, A-, C-, G-, or T-, then the probablity of emitting an "A" in an "A+" or "A-" state is always 1. Conversely, the probability of emitting any other character (C, G, or T) is 0.