Variational Lower Bound on marginal log-likelihood

2k Views Asked by At

I'm currently trying to understand the math behind the publication of Xu et al. (Neural Image Caption Generation with Visual Attention).

In this publication they define an objective function $L_s$ which is a variational lower bound on the marginal log-likelihood $p(y|a)$ of observing a sequence of words y given image features a.

$L_s = \sum_s p(s|a) log[ p(y|s,a)] \leq log [\sum_s p(s|a)p(y|s,a)] = log [p(y|a)]$ (equation 10)

The objective is to derive parameters W (which are no further specified) by directly optimizing $L_s$

Can someone explain how the following equation was obtained? p(y|s,a) and p(s|a) are no furhter specified.

$\frac{\partial L_s}{\partial W} = \sum_s p(s|a)\left[ \frac{\partial log[p(y|s,a)]}{\partial W} + log[p(y|s,a)] \frac{\partial log[p(s|a)]}{\partial W} \right] $

Thanks in advance!

1

There are 1 best solutions below

6
On BEST ANSWER

Published literature is not always the best place to find all the terms fully explained.

I haven't read the paper but presumably the authors are aiming to perform inference on the parameters $a$ by considering the distribution $p(y|a)$, now for whatever reason this distribution is presumably intractable in it's given form, but if we introduce some latent variables $s \in \mathcal{S} $ and the vector $W$ is some instance of $s$,

Now since $a$ is kept fixed throughout this I am going to drop it from my notation and then reintroduce it at the end, now conventional variational inference tells us that \begin{align*} \ln p(y) = \mathcal{L} + D_{KL}, \end{align*} where \begin{align*} \mathcal{L} = \sum_s q(s) \log \frac{ p(y ,s) }{q(s)} \end{align*} and \begin{align*} D_{KL} = -\sum_s q(s) \log \frac{ p(s|y) }{q(s)}. \end{align*} So that using \begin{align*} p(y,s) = p(y|s)q(s), \end{align*} we have \begin{align*} \ln p(y) &= \sum_s q(s) \log \frac{p(y,s)}{q(s)}+ D_{KL} \\ &= \sum_s q(s) \log \frac{ p(y|s)q(s) }{q(s)} + D_{KL}\\ &= \sum_s q(s) \log p(y | s) + D_{KL} \\ &= L_s + D_{KL} \end{align*} Now since $D_{KL} \geq 0$ we have $L_s \leq \log p(y)$ which is the sense in which it is a "lower bound" on the log probability. To complete the conversion to their notation just add the additional conditional dependence on $a$.

Now to maximise the marginal log-likelihood for a fixed value of $a$ we can proceed to try and make $L_s$ as large as possible. Now I'm finding their notation a bit clunky - perhaps it is clearer in the article, but it seems $W$ is just a particular instance of the state of $s$, but anyway we have using the product rule and switching to the logarithmic derivative \begin{align*} \frac{\partial}{\partial W}L_s &= \sum_s \frac{\partial p(s|a)}{\partial W}\log p(y | s , a) + p(s|a) \frac{\partial p(y | s ,a)}{\partial W} \\ &= \sum_s \left( p(s | a)\frac{\partial \log p(s|a)}{\partial W} \right)\log p(y|s , a) + p(s |a)\frac{\partial \log p(y |s,a)}{\partial W} \\ &= \sum_s p(s|a) \left[ \frac{\partial \log p(s|a)}{\partial W}\log p(y| s,a) + \frac{\partial \log p(y| s, a )}{\partial W} \right] \end{align*}