Finding a mixture of 1st and 0'th order Markov models that is closest to an empirical distribution

61 Views Asked by At

I am interested in finding the distribution "$p^*$" closest to an empirical distribution $\hat{p}$ where $p^*$ is a mixture of first and zeroth order Markov models. That is, I want to find $$ p^* = \arg\min_p \sum_{i,j} D\left(\,\hat{p}(j|i)\, \| \, p(j|i) \,\right) $$ subject to the following constraints

  1. $p(j|i) = \alpha \cdot r(j) + (1 - \alpha) \cdot q(j|i)$
  2. $r(j) \ge 0$ for all $j$
  3. $\sum_j r(j) = 1$
  4. $q(j|i) \ge 0$ for all $i$ and $j$
  5. $\sum_j q(j|i) = 1$ for all $i$

where $\alpha$ is a mixture parameter in $[0,1]$ that is given and fixed.

I know I can hand this off to a solver, but I am actually interested in deriving the updates and writing the optimization procedure myself. Therefore, any assistance in this endeavor is greatly appreciated.

1

There are 1 best solutions below

1
On

The first thing that comes to mind is good old stochastic gradient descent. With vocabulary of size $m$, you have $m^2+m+1$ parameters to learn: all $q_{ij}$, all $r_j$ and $\alpha$.

I will do a softmax reparametrization in order to avoid constraints: $r_j = \frac{\exp(a_j)}{\sum_{k=1}^{m}\exp(a_k)}$, $q_{ij} = \frac{\exp(b_{ij})}{\sum_{k=1}^{m}\exp(b_{ik})}$, $\alpha=\frac{1}{1+exp(-c)}$. It will make probabilities strictly positive, but I think it is more a feature than a bug.

For each training pair $ij$, log-likelihood is given by $$ ll_{ij} = \log(\alpha r_j + (1-\alpha) q_{ij})$$ We can randomly choose a pair $ij$ and update the active parameters, with gradient of log likelihood times $\lambda$: $$c+=\lambda\frac{r_j - q_{ij}}{\alpha r_j + (1-\alpha) q_{ij}}\alpha(1-\alpha)$$ $$a_j+=\lambda\frac{\alpha}{\alpha r_j + (1-\alpha) q_{ij}}r_j(1-r_j)$$ $$a_l+=\lambda\frac{-\alpha}{\alpha r_j + (1-\alpha) q_{ij}}r_j r_l, l\neq j$$ $$b_{ij}+=\lambda\frac{1-\alpha}{\alpha r_j + (1-\alpha) q_{ij}}q_{ij}(1-q_{ij})$$ $$b_{il}+=\lambda\frac{-(1-\alpha)}{\alpha r_j + (1-\alpha) q_{ij}}q_{ij}q_{il}, l \neq j$$

I would advise to optimize $a$ and $b$ separately from $c$, because they would probably need different learning rates $\lambda$.