From Boyd & Vandenberghe's Convex Optimization:
Why $-u_i + v_i$ term is there? Is this the original definition of KL-divergence and for probability distribution these terms are not there as they cancel out?
From Boyd & Vandenberghe's Convex Optimization:
Why $-u_i + v_i$ term is there? Is this the original definition of KL-divergence and for probability distribution these terms are not there as they cancel out?
Copyright © 2021 JogjaFile Inc.

Suppose $\mu$ and $\nu$ are finite measures on $(X,\mathscr{F})$ and that $\nu\ll \mu$, and assume $\nu(X)>0$. Define
\begin{align} H(\nu|\mu)&:=\int_X\log\big(\frac{d\nu}{d\mu}\big)\,d\nu-\big(\mu(X)-\nu(X)\big)\\ &=\int_X\Big(\log\big(\frac{d\nu}{d\mu}\big)\frac{d\nu}{d\mu}-1+\frac{d\nu}{d\mu}\Big)\,d\mu \end{align} Notice that if $\mu(X)=\nu(X)$ the definition above yields back the definition of entropy (KL-divergence) for probability measures.
Here is a relation between to "more" general concept of KL-divergence and that for probability measures.
Consider the normalized measures $\overline{\mu}=\frac{1}{\mu(X)}\mu$ and $\overline{\nu}=\frac{1}{\nu(X)}\nu$. Then \begin{align} 0\leq H(\overline{\nu}|\overline{\mu})&=\frac{1}{\mu(X)}\left(\int_X\log\big(\frac{\mu(X)}{\nu(X)}\big) \frac{\mu(X)}{\nu(X)}\frac{d\nu}{d\mu}\,d\mu+\int_X\log\big(\frac{d\nu}{d\mu}\big)\frac{\mu(X)}{\nu(X)}\frac{d\nu}{d\mu}\,d\mu\right)\\ &=\log\big(\frac{\mu(X)}{\nu(X)}\big)+\frac{1}{\nu(X)}\int_X\log\big(\frac{d\nu}{d\mu}\big)\frac{d\nu}{d\mu}\,d\mu\\ &=\log\big(\frac{\mu(X)}{\nu(X)}\big)+\frac{1}{\nu(X)}\Big(H(\nu|\mu)+\mu(X)-\nu(X)\Big) \end{align} Hence \begin{align} H(\nu|\mu)&=\nu(X)\Big(H(\overline{\nu}|\overline{\mu})-\log(\mu(X)/\nu(X))\Big)-(\mu(X)-\nu(X))\\ &\geq-\nu(X)\Big(\log(\mu(X)/\nu(X))+\big(\frac{\mu(X)}{\nu(X)}-1\big)\Big) \end{align}
If $\mu(X)\leq\nu(X)$, then $H(\nu|\mu)\geq0$ and if $\mu(X)>\nu(X)$, $H(\nu|\mu)<0$.