The objective is, $\log P(X) - \mathcal{D}[Q(z|X)||P(z|X)]=E_{z\sim Q}[\log P(X|z)] - \mathcal{D}[Q(z|X)||P(z)]$.
Obviously, $\log P(X) \geq E_{z\sim Q}[\log P(X|z)]$ (by Jensen's inequality).
So we can directly have $\mathcal{D}[Q(z|X)||P(z|X)] \geq \mathcal{D}[Q(z|X)||P(z)]$, which means $P(z|X)$ is further from $Q(z|X)$ than $P(z)$ does according to the interpretation of $\mathcal{D}$ KL-divergence, seemingly against the intuition.
So why? How to explain it?
Thank you.
Enclosed (C. Doersch et. al. 2016) is the reference to VAE.
Note that $P(X) = \int p(X\vert z) P(z) dz$, per Eq. 1. Thus, while $$ E_{z \sim Q}[\log P(X\vert z)]= \int Q(z) [\log P(X\vert z)] dz \le \log \left[\int Q(z) P(X\vert z) dz\right]$$ by Jensen's inequality, this is not the same as $\log P(X)$.
A good way to think of the objective in Eq. 4, $\log P(X) - D[Q(z)\Vert P(z|X)]$, is that maximizing it both increases the likelihood of data (the $\log P(X)$ term) and forces the actual decoder $Q(z)$ to be close to the optimal decoder $P(z|X)$ (the KL-divergence term).