Find closest distribution to minimize KL divergence

922 Views Asked by At

On page 364 of Elements of information theory, 2nd edition, the set $E$ is difined as \begin{equation} E=\Bigl\{P:\sum_{a}P(a)g_j(a)\geq\alpha_j,j=1,2,\ldots,k\Bigr\}. \end{equation} To find the closest distribution in $E$ to $Q$, we minimize $D(P||Q)$ subject to the constraints using Lagrange multipliers, we construct the functional \begin{equation} J(P)=\sum_{x}P(x)\log \frac{P(x)}{Q(x)}+\sum_{i}\lambda_i\sum_{x}P(x)g_i(x)+v\sum_{x}P(x). \end{equation} Then we differentiate and calculate the closest distribution to $Q$ to be of the form \begin{equation} P^*(x)=\frac{Q(x)e^{\sum_{i}\lambda_i g_i(x)}}{\sum_{a \in \mathcal{X}}Q(a)e^{\sum_{i}\lambda_i g_i(a)}}, \end{equation} where the constants $\lambda_i$ are chosen to satisfy the constraints.

I cannot understand the procedure for two points. One is why the Lagrange functional is such a form as $J(P)$ which is different from most of the materials I have read as follows \begin{equation} L(P)=\sum_{x}P(x)\log \frac{P(x)}{Q(x)}+\sum_{i}\lambda_i\left(\sum_{x}P(x)g_i(x)-\alpha_i\right)+v\left(\sum_{x}P(x)-1\right). \end{equation} The other is how to calculate $P^*(x)$ based on such $J(P)$, can anyone offer me the details?

Thanks a lot~

2

There are 2 best solutions below

0
On

The omitted terms in the Lagrange multipliers would drop out after differentiation, and if you are experienced you know that ensuring the $P(x)$ sum to 1 will be invariant to those extra factors since the expression for $P^{\ast}(x)$ is homogeneous.

In fact, even the $v\sum_{x}P(x)$ term could be omitted since the stationary point solution obtained via Lagrange can simply be scaled by normalization.

As for computing the $P^\ast$ numerical methods are used typically.

0
On

I have completed the derivation. As stated by the previous answerer, the omitted terms in the Lagrange multipliers would droup out after differentiation and the differentiation is $$ \frac{\partial J}{\partial P(x)}=\log \frac{P(x)}{Q(x)}+1+\sum_{i} \lambda_ig_i(x)+v=0. $$ Solving the set of these equations we obtain $$ \frac{P(x)}{Q(x)e^{-\sum_{i} \lambda_ig_i(x)}}=e^{-(1+v)}. $$ Hence $P(x)$ is proportional to $Q(x)e^{-\sum_{i} \lambda_ig_i(x)}$. Moreover, considering the constraint $\sum_{x} P(x)=1$, we obtain $$ P(x)=\frac{Q(x)e^{-\sum_{i} \lambda_ig_i(x)}}{\sum_{a}Q(a)e^{-\sum_{i} \lambda_ig_i(a)}}, $$ where $-\lambda_i$ can be replaced by $\lambda_i$ without affecting the final results.