I'm currently trying to understand the math behind the publication of Xu et al. (Neural Image Caption Generation with Visual Attention).
In this publication they define an objective function $L_s$ which is a variational lower bound on the marginal log-likelihood $p(y|a)$ of observing a sequence of words y given image features a.
$L_s = \sum_s p(s|a) log[ p(y|s,a)] \leq log [\sum_s p(s|a)p(y|s,a)] = log [p(y|a)]$ (equation 10)
The objective is to derive parameters W (which are no further specified) by directly optimizing $L_s$
Can someone explain how the following equation was obtained? p(y|s,a) and p(s|a) are no furhter specified.
$\frac{\partial L_s}{\partial W} = \sum_s p(s|a)\left[ \frac{\partial log[p(y|s,a)]}{\partial W} + log[p(y|s,a)] \frac{\partial log[p(s|a)]}{\partial W} \right] $
Thanks in advance!
Published literature is not always the best place to find all the terms fully explained.
I haven't read the paper but presumably the authors are aiming to perform inference on the parameters $a$ by considering the distribution $p(y|a)$, now for whatever reason this distribution is presumably intractable in it's given form, but if we introduce some latent variables $s \in \mathcal{S} $ and the vector $W$ is some instance of $s$,
Now since $a$ is kept fixed throughout this I am going to drop it from my notation and then reintroduce it at the end, now conventional variational inference tells us that \begin{align*} \ln p(y) = \mathcal{L} + D_{KL}, \end{align*} where \begin{align*} \mathcal{L} = \sum_s q(s) \log \frac{ p(y ,s) }{q(s)} \end{align*} and \begin{align*} D_{KL} = -\sum_s q(s) \log \frac{ p(s|y) }{q(s)}. \end{align*} So that using \begin{align*} p(y,s) = p(y|s)q(s), \end{align*} we have \begin{align*} \ln p(y) &= \sum_s q(s) \log \frac{p(y,s)}{q(s)}+ D_{KL} \\ &= \sum_s q(s) \log \frac{ p(y|s)q(s) }{q(s)} + D_{KL}\\ &= \sum_s q(s) \log p(y | s) + D_{KL} \\ &= L_s + D_{KL} \end{align*} Now since $D_{KL} \geq 0$ we have $L_s \leq \log p(y)$ which is the sense in which it is a "lower bound" on the log probability. To complete the conversion to their notation just add the additional conditional dependence on $a$.
Now to maximise the marginal log-likelihood for a fixed value of $a$ we can proceed to try and make $L_s$ as large as possible. Now I'm finding their notation a bit clunky - perhaps it is clearer in the article, but it seems $W$ is just a particular instance of the state of $s$, but anyway we have using the product rule and switching to the logarithmic derivative \begin{align*} \frac{\partial}{\partial W}L_s &= \sum_s \frac{\partial p(s|a)}{\partial W}\log p(y | s , a) + p(s|a) \frac{\partial p(y | s ,a)}{\partial W} \\ &= \sum_s \left( p(s | a)\frac{\partial \log p(s|a)}{\partial W} \right)\log p(y|s , a) + p(s |a)\frac{\partial \log p(y |s,a)}{\partial W} \\ &= \sum_s p(s|a) \left[ \frac{\partial \log p(s|a)}{\partial W}\log p(y| s,a) + \frac{\partial \log p(y| s, a )}{\partial W} \right] \end{align*}