On page 364 of Elements of information theory, 2nd edition, the set $E$ is difined as \begin{equation} E=\Bigl\{P:\sum_{a}P(a)g_j(a)\geq\alpha_j,j=1,2,\ldots,k\Bigr\}. \end{equation} To find the closest distribution in $E$ to $Q$, we minimize $D(P||Q)$ subject to the constraints using Lagrange multipliers, we construct the functional \begin{equation} J(P)=\sum_{x}P(x)\log \frac{P(x)}{Q(x)}+\sum_{i}\lambda_i\sum_{x}P(x)g_i(x)+v\sum_{x}P(x). \end{equation} Then we differentiate and calculate the closest distribution to $Q$ to be of the form \begin{equation} P^*(x)=\frac{Q(x)e^{\sum_{i}\lambda_i g_i(x)}}{\sum_{a \in \mathcal{X}}Q(a)e^{\sum_{i}\lambda_i g_i(a)}}, \end{equation} where the constants $\lambda_i$ are chosen to satisfy the constraints.
I cannot understand the procedure for two points. One is why the Lagrange functional is such a form as $J(P)$ which is different from most of the materials I have read as follows \begin{equation} L(P)=\sum_{x}P(x)\log \frac{P(x)}{Q(x)}+\sum_{i}\lambda_i\left(\sum_{x}P(x)g_i(x)-\alpha_i\right)+v\left(\sum_{x}P(x)-1\right). \end{equation} The other is how to calculate $P^*(x)$ based on such $J(P)$, can anyone offer me the details?
Thanks a lot~
The omitted terms in the Lagrange multipliers would drop out after differentiation, and if you are experienced you know that ensuring the $P(x)$ sum to 1 will be invariant to those extra factors since the expression for $P^{\ast}(x)$ is homogeneous.
In fact, even the $v\sum_{x}P(x)$ term could be omitted since the stationary point solution obtained via Lagrange can simply be scaled by normalization.
As for computing the $P^\ast$ numerical methods are used typically.