I am following Yoshua Bengio's Learning Deep Architectures for AI and at page 31 there is a phrase that confuses me.
Starting by lemma 7.1 in the same page:
Lemma 7.1. Consider the Gibbs chain $x_1 \rightarrow h_1 \rightarrow x_2 \rightarrow h_2 ...$ starting at data point $x_1$. The log-likelihood can be expanded as follows for any path of the chain:
$\log P(x_1) = \log P(x_t) + \sum\limits_{s=1}^{t-1} \log \frac{P(x_s|h_s)}{P(h_s|x_s)} + \log \frac{P(h_s|x_{s+1})}{P(x_{s+1}|h_s)}$
and consequently, since this is true for any path:
$\log P(x_1) = E\left(\log P(x_t)\right) + \sum\limits_{s=1}^{t-1} E\left(\log \frac{P(x_s|h_s)}{P(h_s|x_s)} + \log \frac{P(h_s|x_{s+1})}{P(x_{s+1}|h_s)}\right)$
where the expectation is over the Markov chain, conditional on $x_1$.
$P$ here is the distribution from which we want to sample. This is ok for me, the problem is what comes afterwards:
In the limit $t \rightarrow \infty$, the last term is just the entropy of distribution $P(x)$. Note that the terms do not vanish as $t \rightarrow \infty$, so as such this expansion does not justify truncating the series to approximate the log-likelihood.
The bold part is the weird one for me. I've made the following simplifications:
$E\left(\log \frac{P(x_s|h_s)}{P(h_s|x_s)} + \log \frac{P(h_s|x_{s+1})}{P(x_{s+1}|h_s)}\right)$
$= E\left(\log \left(\frac{P(x_s|h_s)}{P(h_s|x_s)} \frac{P(h_s|x_{s+1})}{P(x_{s+1}|h_s)}\right)\right)$
(by applying sums of logarithms property)
$= E\left(\log \frac{P(x_s)}{P(x_{s+1})}\right)$
(by "unconditioning" the probabilities)
$= E\left(\log P(x_s)\right) - E\left(\log P(x_{s+1})\right)$
(by reapplying logarithm's sum property in the opposite way and expectation's linearity)
As $s \rightarrow \infty$ the chain converges:
$E(\log P(x_s)) = E(\log P(x_{s+1}) ) = E(\log P(x))$
and the term inside summation becomes 0. So, isn't it vanishing? Have I done any mistakes in my equations or have I misinterpreted the document?
Some updates: I think I was misinterpreting the truncation. The objective here would be to approximate $\log P(x_1)$ by an early sample from the Gibbs chain, and this sampling takes the form of the expectation inside the summation. The problem is that the term $E\left(\log P(x_t)\right)$ does not vanish, but I still don't understand why we cannot estimate this term as well by a sample from the chain.