I am interested in finding the distribution "$p^*$" closest to an empirical distribution $\hat{p}$ where $p^*$ is a mixture of first and zeroth order Markov models. That is, I want to find $$ p^* = \arg\min_p \sum_{i,j} D\left(\,\hat{p}(j|i)\, \| \, p(j|i) \,\right) $$ subject to the following constraints
- $p(j|i) = \alpha \cdot r(j) + (1 - \alpha) \cdot q(j|i)$
- $r(j) \ge 0$ for all $j$
- $\sum_j r(j) = 1$
- $q(j|i) \ge 0$ for all $i$ and $j$
- $\sum_j q(j|i) = 1$ for all $i$
where $\alpha$ is a mixture parameter in $[0,1]$ that is given and fixed.
I know I can hand this off to a solver, but I am actually interested in deriving the updates and writing the optimization procedure myself. Therefore, any assistance in this endeavor is greatly appreciated.
The first thing that comes to mind is good old stochastic gradient descent. With vocabulary of size $m$, you have $m^2+m+1$ parameters to learn: all $q_{ij}$, all $r_j$ and $\alpha$.
I will do a softmax reparametrization in order to avoid constraints: $r_j = \frac{\exp(a_j)}{\sum_{k=1}^{m}\exp(a_k)}$, $q_{ij} = \frac{\exp(b_{ij})}{\sum_{k=1}^{m}\exp(b_{ik})}$, $\alpha=\frac{1}{1+exp(-c)}$. It will make probabilities strictly positive, but I think it is more a feature than a bug.
For each training pair $ij$, log-likelihood is given by $$ ll_{ij} = \log(\alpha r_j + (1-\alpha) q_{ij})$$ We can randomly choose a pair $ij$ and update the active parameters, with gradient of log likelihood times $\lambda$: $$c+=\lambda\frac{r_j - q_{ij}}{\alpha r_j + (1-\alpha) q_{ij}}\alpha(1-\alpha)$$ $$a_j+=\lambda\frac{\alpha}{\alpha r_j + (1-\alpha) q_{ij}}r_j(1-r_j)$$ $$a_l+=\lambda\frac{-\alpha}{\alpha r_j + (1-\alpha) q_{ij}}r_j r_l, l\neq j$$ $$b_{ij}+=\lambda\frac{1-\alpha}{\alpha r_j + (1-\alpha) q_{ij}}q_{ij}(1-q_{ij})$$ $$b_{il}+=\lambda\frac{-(1-\alpha)}{\alpha r_j + (1-\alpha) q_{ij}}q_{ij}q_{il}, l \neq j$$
I would advise to optimize $a$ and $b$ separately from $c$, because they would probably need different learning rates $\lambda$.