I would like to use gradient-descent to fit the parameters of a simple 2-state HMM. This paper
- Levinson, S. E., Rabiner, L. R. and Sondhi, M. M. (1983), An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition. Bell System Technical Journal, 62: 1035–1074. doi: 10.1002/j.1538-7305.1983.tb03114.x
shows derivations of the partial derivatives required for gradient descent, but I am having trouble following the steps. Namely, the paper starts by stating that:
$$P(\mathbf{O}|\mathbf{M})=\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_{t}(i)a_{ij}b_{j}(O_{t+1})\beta_{t+1}(j)$$
for any $t$ such that $1 \leq t \leq T-1$, where $\mathbf{O} = O_1...O_T$ are the observations and $\mathbf{M}$ represents the transition and emission matrices. $\alpha$ is the forward probability and $\beta$ is the backwards probability. $a_{ij}$ is the probability of transition from state $i$ to state $j$ and $b_j(O_{t+1})$ is the probability of observing $O_{t+1}$ given state $j$. Finally, $N$ is the number of states.
To run gradient descent, one needs to calculate the partial derivatives with respect to model parameters. The paper derives:
$$\frac{\partial{P}}{\partial{a_{ij}}}=\sum_{t=1}^{T-1}\alpha_t(i)b_j(O_{t+1})\beta_{t+1}(j)$$
But the exact steps for reaching this formula are not shown. Can anyone elaborate on how this result was obtained?
Thank you,