Two questions about minimizing KL-divergence

336 Views Asked by At

I have two questions on the following lemma.

enter image description here

$1$. How did we get the last inequality? It seems to me that the author is saying $\int p_{\theta_0}\,d\mu = 1$. But I don't see why that is true.

$2$. Is this lemma over-complicating thing? I think this lemma is saying that KL-divergence is uniquely minimized at true parameter $\theta_0$, when the true model is identifiable. I think we can prove this more easily using Jensen's inequality, like in section 5.2 in this notes.

1

There are 1 best solutions below

0
On BEST ANSWER
  1. The $p_{\theta}$ are "subprobability densities", implying that $\int p_{\theta}\mathrm{d}\mu\leq 1$ for all $\theta:\Theta$. Then $$\begin{split} 2\int\sqrt{p_{\theta}p_{\theta_0}}\mathrm{d}\mu -2 &\leq 2\int\sqrt{p_{\theta}p_{\theta_0}}\mathrm{d}\mu - \int p_{\theta}\mathrm{d}\mu -\int p_{\theta}\mathrm{d}\mu \\ &=-\int\left(p_{\theta}-2\sqrt{p_{\theta}p_{\theta_0}}+p_{\theta_0}\right)\mathrm{d}\mu\\ &=-\int\left(\sqrt{p_{\theta}}-\sqrt{p_{\theta_0}}\right)^2\mathrm{d}\mu \end{split}$$

  2. I agree, the "right" way to prove this result is to exploit the theory of convex functions. Perhaps the author was trying to borrow intuition from the theory of inner product spaces?

It might be worth mentioning that the proof of this lemma gives the scholium $$H(P,Q)^2\leq D(Q\parallel P)$$ where $P$ and $Q$ are probability distributions, $H$ is the Hellinger distance, and $D$ is the K–L divergence.