So this may seem like a bioinformatics question but it is the math part that is giving me trouble. I'm using a Python package called YAHMM to model DNA sequences. I created a model with two states (state1 and state2) and two silent states (start and end)
Transitions: state1= s1, state2= s2
s1 > s1= 0.4
s1 > s2= 0.4
s1 > end= 0.2
s2 > s2= 0.4
s2 > s1= 0.4
s2 > end= 0.2
State 1 Probabilities:
A= 0.4
T= 0.4
C= 0.1
G= 0.1
State 2 Probabilities:
A = T = C = G = 0.25
My sequence X= TTA and the program gave me this matrix as a result from the forward algorithm computation using natural log probabilities:
[[ -inf -inf 0. -inf]
[-1.60943791 -2.07944154 -inf -2.73336801]
[-2.95651156 -3.42651519 -inf -4.08044166]
[-4.30358521 -4.77358884 -inf -5.42751531]]
Where the columns and rows are labeled:
state 1 | state 2 | start | end
position 1
position 2
position 3
So I can get the first row correct, M(i,j) where i= observation and j= state M(1,1)= ln([transition from start to state1][emission of T])= (ln(0.5*0.4)= -1.609 M(1,2)= ln([transition from start to state1][emission of T])= (ln(0.5*0.25)= -2.079
HERE'S THE PART I CAN'T FIGURE OUT:
In the forward algorithm each entry in the matrix is supposed to be the probability of all paths that could lead to that position in the matrix, i.e. that state for that position. So for M(2,1) it should be a T being emitted from state1 for position 2.
There are two possible paths for this,
path1= start > state 1 emits T > state 1 emits T = [ln(0.5*0.4) + ln(0.4*0.4)]
path2= start > state 2 emits T > state 1 emits T = [ln(0.5*0.25) + ln(0.4*0.4)]
so my computational approach was to add these paths together (since its log space)
[ln(0.5*0.4) + ln(0.4*0.4)] + [ln(0.5*0.25) + ln(0.4*0.4)] = -7.354
This is way off from -2.9565 so can somebody tell me where i'm going wrong? I really want to understand the way this algorithm works.
