I'm currently trying to understand the math in this paper Xu et al. (Neural Image Caption Generation with Visual Attention).
From the paper:
There is an objective function $L_s$ which is a variational lower bound on the marginal log-likelihood $p(y|a)$
\begin{align*} L_s = \sum_s p(s|a) log[ p(y|s,a)] \leq log [\sum_s p(s|a)p(y|s,a)] = log [p(y|a)] \end{align*}
The objective is to derive parameters W by optimising $L_s$
\begin{align*} \frac{\partial L_s}{\partial W} = \sum_s p(s|a)\left[ \frac{\partial log[p(y|s,a)]}{\partial W} + log[p(y|s,a)] \frac{\partial log[p(s|a)]}{\partial W} \right] \end{align*}
I found an answer (2) which explains how to form the variational lower bound on the marginal log-likelihood p(y|a):
From (2): \begin{align*} \frac{\partial}{\partial W}L_s &= \sum_s \frac{\partial p(s|a)}{\partial W}\log p(y | s , a) + p(s|a) \frac{\partial p(y | s ,a)}{\partial W} \\ &= \sum_s \left( p(s | a)\frac{\partial \log p(s|a)}{\partial W} \right)\log p(y|s , a) + p(s |a)\frac{\partial \log p(y |s,a)}{\partial W} \\ &= \sum_s p(s|a) \left[ \frac{\partial \log p(s|a)}{\partial W}\log p(y| s,a) + \frac{\partial \log p(y| s, a )}{\partial W} \right] \end{align*}
I don't understand how the derivative is formed. From my understanding, finding the derivative of $Ls$ using the product rule:
\begin{align*} f &= p(s|a) \\ f' &= \frac{\partial p(s|a)}{\partial W} \\ \\ g &= \log p(y| s, a )\\ \end{align*}
Why is the derivative of g this:
\begin{align*} g' &= \frac{\partial p(y| s, a )}{\partial W} \end{align*}
and not this:
\begin{align*} g' &= \frac{1}{p(y| s, a )} \frac{\partial p(y| s, a )}{\partial W} \end{align*}
Secondly, the transition from the first line of the derivative to the second. When the first term gets multiplied by the log it ends up with $p(s|a)$ out front: \begin{align*} p(s|a) \frac{\partial \log p(s|a)}{\partial W} \end{align*}
Whereas the second term only gets a log introduced: \begin{align*} \frac{\partial \log p(y |s,a)}{\partial W} \end{align*}
Why is this?
Thanks!