Gibbs sampling truncation for contrastive divergence

80 Views Asked by At

I am following Yoshua Bengio's Learning Deep Architectures for AI and at page 31 there is a phrase that confuses me.

Starting by lemma 7.1 in the same page:

Lemma 7.1. Consider the Gibbs chain $x_1 \rightarrow h_1 \rightarrow x_2 \rightarrow h_2 ...$ starting at data point $x_1$. The log-likelihood can be expanded as follows for any path of the chain:

$\log P(x_1) = \log P(x_t) + \sum\limits_{s=1}^{t-1} \log \frac{P(x_s|h_s)}{P(h_s|x_s)} + \log \frac{P(h_s|x_{s+1})}{P(x_{s+1}|h_s)}$

and consequently, since this is true for any path:

$\log P(x_1) = E\left(\log P(x_t)\right) + \sum\limits_{s=1}^{t-1} E\left(\log \frac{P(x_s|h_s)}{P(h_s|x_s)} + \log \frac{P(h_s|x_{s+1})}{P(x_{s+1}|h_s)}\right)$

where the expectation is over the Markov chain, conditional on $x_1$.

$P$ here is the distribution from which we want to sample. This is ok for me, the problem is what comes afterwards:

In the limit $t \rightarrow \infty$, the last term is just the entropy of distribution $P(x)$. Note that the terms do not vanish as $t \rightarrow \infty$, so as such this expansion does not justify truncating the series to approximate the log-likelihood.

The bold part is the weird one for me. I've made the following simplifications:

$E\left(\log \frac{P(x_s|h_s)}{P(h_s|x_s)} + \log \frac{P(h_s|x_{s+1})}{P(x_{s+1}|h_s)}\right)$

$= E\left(\log \left(\frac{P(x_s|h_s)}{P(h_s|x_s)} \frac{P(h_s|x_{s+1})}{P(x_{s+1}|h_s)}\right)\right)$

(by applying sums of logarithms property)

$= E\left(\log \frac{P(x_s)}{P(x_{s+1})}\right)$

(by "unconditioning" the probabilities)

$= E\left(\log P(x_s)\right) - E\left(\log P(x_{s+1})\right)$

(by reapplying logarithm's sum property in the opposite way and expectation's linearity)

As $s \rightarrow \infty$ the chain converges:

$E(\log P(x_s)) = E(\log P(x_{s+1}) ) = E(\log P(x))$

and the term inside summation becomes 0. So, isn't it vanishing? Have I done any mistakes in my equations or have I misinterpreted the document?

Some updates: I think I was misinterpreting the truncation. The objective here would be to approximate $\log P(x_1)$ by an early sample from the Gibbs chain, and this sampling takes the form of the expectation inside the summation. The problem is that the term $E\left(\log P(x_t)\right)$ does not vanish, but I still don't understand why we cannot estimate this term as well by a sample from the chain.