Gradient of the loglikelihood for the RSM (contrastive divergence)

61 Views Asked by At

I'm actually implementing an RSM in TensorFlow and I've realized that when I used the energy function and let TensorFlow compute the gradient, the ouput of my RSM (Replicated Softmax Model) is pure nonsense. On the contrary, when I update the weights by hand with the update mentioned everywhere, the outputs of my RSM make sense, yet it seems (for me) that the update I saw everywhere is wrong. For example in this code (line 70-72) or in this code (line 45-47) the update is similar except that one use the sum and the other use the mean to update the biais $b_v$ and $b_h$

So actually I don't understand anything. What is the correct update and how can I make it works in TensorFlow just by letting TensroFlow compute the gradient of the difference of the energy function?

Here is what I have so far:

According to the paper by Hinton et al. we have that:

$P(\mathbf{V}) = \dfrac{1}{\mathbf{Z}} \sum\limits_{h} exp(-E(\mathbf{V}, \mathbf{h}))$

where:

$E(\mathbf{V}, \mathbf{h}) = - \sum\limits_{j=1}^F \sum\limits_{k=1}^K W_{j}^k h_j \widehat{v}^k - \sum\limits_{k=1}^K \widehat{v}^k b_v^k - D \sum\limits_{j=1}^F h_j b_{h,j}$

respectively equations (2) and (5) from the mentioned paper. So let me derive the log-likelihood of a general RSM model that follow equation (2):

$\dfrac{\partial}{\partial \mathbf{\theta}} log\; p(\mathbf{V}; \mathbf{\theta}) \\ = \dfrac{\partial}{\partial \mathbf{\theta}} log\left( \dfrac{\sum_{\mathbf{h'}} exp(-E(\mathbf{V}, \mathbf{h'}, \mathbf{\theta})) }{Z(\mathbf{\theta})} \right) \\ = \dfrac{\partial}{\partial \mathbf{\theta}} log \sum_{\mathbf{h'}} exp(-E(\mathbf{V}, \mathbf{h'}, \mathbf{\theta})) - \dfrac{\partial}{\partial \mathbf{\theta}} log \; Z(\mathbf{\theta}) \\ = \dfrac{ \dfrac{\partial}{\partial \mathbf{\theta}} \sum_{\mathbf{h'}} exp(-E(\mathbf{V}, \mathbf{h'}, \mathbf{\theta}))}{\sum_{\mathbf{h'}} exp(-E(\mathbf{V}, \mathbf{h'}, \mathbf{\theta}))} - \dfrac{\dfrac{\partial}{\partial \mathbf{\theta}} Z(\mathbf{\theta})}{Z(\mathbf{\theta})} \\ = \dfrac{\sum_{\mathbf{h'}} exp(-E(\mathbf{V}, \mathbf{h'}, \mathbf{\theta})) \dfrac{\partial}{\partial \mathbf{\theta}}(-E(\mathbf{V}, \mathbf{h'}, \mathbf{\theta})) }{\sum_{\mathbf{h'}} exp(-E(\mathbf{V}, \mathbf{h'}, \mathbf{\theta}))} = - \dfrac{\sum_{\mathbf{V', h'}} exp(-E(\mathbf{V'}, \mathbf{h'}, \mathbf{\theta})) \dfrac{\partial}{\partial \mathbf{\theta}}(-E(\mathbf{V'}, \mathbf{h'}, \mathbf{\theta}) )}{Z(\mathbf{\theta})} \\ = \sum\limits_{\mathbf{h'}} \left( \dfrac{exp(-E(\mathbf{V}, \mathbf{h}, \mathbf{\theta}))}{\sum_{\mathbf{h'}} exp(-E(\mathbf{V}, \mathbf{h'}, \mathbf{\theta}))} \dfrac{\partial}{\partial \theta}\left( -E(\mathbf{V}, \mathbf{h}, \mathbf{\theta}) \right)\right) \\ - \sum\limits_{\mathbf{V', h'}} \left( \dfrac{exp(-E(\mathbf{V}, \mathbf{h}, \mathbf{\theta}))}{Z(\mathbf{\theta})} \dfrac{\partial}{\partial \mathbf{\theta}}\left( exp(-E(\mathbf{V}, \mathbf{h}, \mathbf{\theta})) \right) \right) \\ = \sum\limits_{\mathbf{h}} p(\mathbf{h}|\mathbf{v}; \mathbf{\theta}) \dfrac{\partial}{\partial \theta}\left( -E(\mathbf{V}, \mathbf{h}, \mathbf{\theta}) \right) - \sum\limits_{\mathbf{V', h'}} p(\mathbf{V'}, \mathbf{h'}; \mathbf{\theta}) \dfrac{\partial}{\partial \theta}\left( -E(\mathbf{V'}, \mathbf{h'}, \mathbf{\theta}) \right) \\ = \sum\limits_{\mathbf{h}} p(\mathbf{h}|\mathbf{v}; \mathbf{\theta}) \dfrac{\partial}{\partial \theta}\left( -E(\mathbf{V}, \mathbf{h}, \mathbf{\theta}) \right) - \sum\limits_{\mathbf{V'}} p(\mathbf{V'}; \mathbf{\theta}) \sum\limits_{\mathbf{h'}} p(\mathbf{h'}| \mathbf{V'}; \mathbf{\theta}) \dfrac{\partial}{\partial \theta}\left( -E(\mathbf{V'}, \mathbf{h'}, \mathbf{\theta}) \right) \\ = F_{\mathbf{\theta}}(\mathbf{V}, \mathbf{\theta}) - E_{p(\mathbf{V}; \mathbf{\theta})}\left[ F_{\mathbf{\theta}}(\mathbf{V}, \mathbf{\theta}) \right]$

with

$F_{\mathbf{\theta}}(\mathbf{u}, \mathbf{\theta}) \triangleq \sum\limits_{\mathbf{h}} p(\mathbf{h}| \mathbf{u}; \mathbf{\theta}) \dfrac{\partial}{\partial \theta}\left( -E(\mathbf{u}, \mathbf{h}; \mathbf{\theta}) \right)$

So we can rewrite it:

$\dfrac{\partial}{\partial \mathbf{\theta}} log\; p(\mathbf{V}; \mathbf{\theta}) = F_{\mathbf{\theta}}(\mathbf{V}, \mathbf{\theta}) - E_{p(\mathbf{V}; \mathbf{\theta})}\left[ F_{\mathbf{\theta}}(\mathbf{V}, \mathbf{\theta}) \right] \\ = \sum\limits_{\mathbf{h}} p(\mathbf{h}|\mathbf{V}; \mathbf{\theta}) \dfrac{\partial}{\partial \theta} E(\mathbf{V}, \mathbf{h}, \mathbf{\theta}) - \mathbb{E}_{p(\mathbf{V}; \mathbf{\theta})} \left[\sum\limits_{\mathbf{h}} p(\mathbf{h}| \mathbf{V}; \mathbf{\theta}) \dfrac{\partial}{\partial \theta} E(\mathbf{V}, \mathbf{h}, \mathbf{\theta}) \right] \\ \approx \sum\limits_{\mathbf{h}} p(\mathbf{h}|\mathbf{V}^{(0)}; \mathbf{\theta}) \dfrac{\partial}{\partial \theta} E(\mathbf{V}^{(0)}, \mathbf{h}, \mathbf{\theta}) - \sum\limits_{\mathbf{h}} p(\mathbf{h}| \mathbf{V}^{(k)}; \mathbf{\theta}) \dfrac{\partial}{\partial \theta} E(\mathbf{V}^{(k)}, \mathbf{h}, \mathbf{\theta})$

Where the approximation comes from the Contrastive divergence that use Gibbs Sampling and the $k$ refers to the number of Gibbs sampling iterations.

So finally, all in all we have that the gradient is just the difference between the expectation taken on the input data and the expectation taken after k iterations of Gibbs sampling.

Moreover we can proof that the energy function of the RSM can be written as:

$E(\mathbf{V_n}) = -\sum\limits_{k=1}^K \widehat{v}_n^{k} b_v^{k} - \sum\limits_{j=1}^F log \left(1 + exp \Big( D_nb_{h,j} + \sum\limits_{k=1}^K \widehat{v}_n^{k} W_j^k \Big) \right)$

Indeed, we have:

$ E(\mathbf{V_n}) = -log \left[ \sum\limits_{h_n} exp \left( - E(\mathbf{V}_n, h_n) \right) \right] \\ = -log \left[ \sum\limits_{h_n} exp \Big( + \sum\limits_{k=1}^K \widehat{v}_n^{k,(t)} b_v^k \Big) exp \left( \sum\limits_{j=1}^F h_{n,j} \Big( D_n b_{h,j} + \sum\limits_{k=1}^K W_j^k \widehat{v}_n^{k} \Big) \right)\right] \\ \stackrel{1}{=} - \sum\limits_{k=1}^K \widehat{v}_n^{k} b_v^k - log \left( \sum\limits_{h_n} exp \left( \sum\limits_{j=1}^F h_{n,j} \Big( D_n b_{h,j} + \sum\limits_{k=1}^K W_j^k \widehat{v}_n^{k} \Big) \right) \right) \\ = - \sum\limits_{k=1}^K \widehat{v}_n^{k} b_v^k - log\left( \sum\limits_{h_n} \prod\limits_{j=1}^F exp \Big[ h_{n,j} \Big( D_n b_{h,j} + \sum\limits_{k=1}^K W_j^k \widehat{v}_n^{k} \Big) \Big] \right) \\ \stackrel{1}{=} - \sum\limits_{k=1}^K \widehat{v}_n^{k} b_v^k - log\left( \prod\limits_{j=1}^F \sum\limits_{h_n} exp \Big[ h_{n,j} \Big( D_n b_{h,j} + \sum\limits_{k=1}^K W_j^k \widehat{v}_n^{k} \Big) \Big] \right) \\ = - \sum\limits_{k=1}^K \widehat{v}_n^{k} b_v^k - \sum\limits_{j=1}^F log \left( \sum\limits_{h_n} exp \left( h_{n,j} \Big( D_n b_{h,j} + \sum\limits_{k=1}^K W_j^k \widehat{v}_n^{k} \Big) \right) \right) \\ \stackrel{3}{=} -\sum\limits_{k=1}^K \widehat{v}_n^{k} b_v^{k} - \sum\limits_{j=1}^F log \left(1 + exp \Big( D_nb_{h,j} + \sum\limits_{k=1}^K \widehat{v}_n^{k} W_j^k \Big) \right)$

where in $1$ we use the fact that $\sum\limits_{k=1}^K \widehat{v}_n^{k} b_v^k$ is independant of the sum over $h_n$. In $2$ we use the inversion of the product and the sum and in the final step we use the fact that $h_n \in \{0,1\}$

So finally, when we consider $N$ observations (here $N$ documents) then the gradient is given approximately by:

$\dfrac{1}{N} \sum\limits_{n=1}^N \dfrac{\partial \mathbf{E}(V_n^{(0)})}{\partial \mathbf{\theta}} - \dfrac{\partial \mathbf{E}(V_n^{(k)})}{\partial \mathbf{\theta}}$

because actually $\sum\limits_{\mathbf{h}} p(\mathbf{h}|\mathbf{V}^{(0)}; \mathbf{\theta}) \dfrac{\partial}{\partial \theta} E(\mathbf{V}^{(0)}, \mathbf{h}, \mathbf{\theta}) = <\dfrac{\partial \mathbf{E}(V^{(0)})}{\partial \mathbf{\theta}}>$

Note: Here the exponent $0$ means we consider the energy on the input data and the exponent $k$ means we consider the energy after $k$ steps of Gibbs sampling. Where $\mathbf{E}$ is the energy function and $<>$ is the expectation.

So finally, when we derive $E(\mathbf{V_n}) = -\sum\limits_{k=1}^K \widehat{v}_n^{k} b_v^{k} - \sum\limits_{j=1}^F log \left(1 + exp \Big( D_nb_{h,j} + \sum\limits_{k=1}^K \widehat{v}_n^{k} W_j^k \Big) \right)$

with respect to $\mathbf{W}$, $\mathbf{b}_h$ and $\mathbf{b}_v$ we actually get the update that we have seen in the python code except for $\dfrac{\partial E(\mathbf{V_n^{(0)})}}{\partial \mathbf{b}_h}$ where the result is multiplied by $D_n$ due to the derivative of $x \rightarrow exp(ax + b)$

So my questions are the following: Did I make a mistake somewhere in my calculus ? Actually I don't think so because when I compare the gradient computed by TensorFlow and the gradient I get from my updates. Here is the gradients I got (so you just need to subtract the gradient taken in $v_n^{(0)}$ where $v_n^{(0)}$ is the original data, from the gradient computed at $v_n^{(\star)}$ where $v_n^{(\star)}$ is the data we got after k iterations of Gibbs sampling):

$\Delta \mathbf{W} = - \begin{pmatrix} {\widehat{v_n}^{(0)}} \\ \vdots \\ {\widehat{v_n}^{(K)}} \\ \end{pmatrix} \begin{pmatrix} \sigma(\mathbf{W}_0 \widehat{v} + D b_0) & \dots & \sigma(\mathbf{W}_F \widehat{v} + D b_F) \end{pmatrix}$

$\Delta \mathbf{b}_h = D_n \begin{pmatrix} \sigma(\mathbf{W}_0 \widehat{v} + D b_0) & \dots & \sigma(\mathbf{W}_F \widehat{v} + D b_F) \end{pmatrix}$

$\Delta \mathbf{b}_v = v_n$

So my result varies by a factor $D_n$ as compare as what I've found in the litterature. However if I use the update I've just write, my RSM actually outputs very bad results (same words for all the topics). On the contrary if I take the code from the same python file again, the result are great! I've try to replace np.mean by np.sum and to multiply by D_n and so on, and so on but the results are always nonsense except if I don't multiply the gradient of $\mathbf{b}_h$ by $D_n$.

Note: in the code we actually have to sum up the contribution of each gradients because we are working on a batch of $N$ observations (here $N$ documents). The thing is,... here again if I actually use a sum my gradients match the gradients that TensorFlow derives from the difference of my energy function taken in $v_n^{(0)}$ and in $v_n^{(\star)}$ but the results are really really bad. On the contrary if I use a mean ONLY on $\mathbf{b}_h$ and $\mathbf{b}_v$ the result are great...

I know it's a long post, but I'm struggling for days and I don't understand what's wrong...

Also I've come up with a different energy function in TensorFlow that allows me to retrieve good results but I don't understand why "it works", here is the energy function:

batch_size = tf.cast(tf.shape(D)[0], dtype=tf.float32)
u = tf.add(bh / batch_size, tf.matmul(x, W))

return -tf.matmul(x, tf.transpose(bv)) - batch_size * tf.reduce_sum(tf.nn.softplus(u), 1, keepdims=True)

I actually seek to derive an energy function that will give me the proper gradients update so I don't have to hard-code the updates. I need it because my RSM is only a part of my model and if I start to hard-code all the gradient updates of my RSM by hand in TensorFlow I will need to update all the remaining gradients before my RSM which I don't want.

Thank you in advance.