Maximize (but not minimize?) entropy for probability distribution

529 Views Asked by At

Let $p$ be an arbitrary discrete probability distribution on $\{1, \dots, 10\}$. We can represent such distribution by a vector indexed by by $i$ subject to the constraints $p_i \ge 0$ and $\sum^{10}_{i=1}p_i = 1$. I want to find analytically the $p$ with maximum entropy which is given by

$$ H(p) = - \sum_{i=1}^{10} p_i \log(p_i) $$

My question consists of two parts:

  1. How do I show that the optimal $p$ is given by the uniform distribution using the Lagrange method.

  2. Why can't the Lagrange method be used to find $p$ with minimal entropy (which I suppose is given by the dirac distribution)?

For 1. I know that the Lagrangian function associated with this constrained problem is given by

$$ \mathcal{L}(p,\lambda) = - \sum_{i=1}^{10} p_i \log(p_i) + \lambda(1 - \sum^{10}_{i=1}p_i) $$

and the respective gradient by

$$ \frac{\partial \mathcal{L}}{\partial p_i} = - \log(p_i) - 1 - \lambda $$

and

$$ \frac{\partial \mathcal{L}}{\partial \lambda} = 1 - \sum^{10}_{i=1}p_i $$

but I don't know how to continue from there.

2

There are 2 best solutions below

0
On BEST ANSWER
  1. Why can't the Lagrange method be used to find $p$ with minimal entropy (which I suppose is given by the dirac distribution)?

The KKT conditions, which can incorporate the nonnegativity constraints, are not necessary for optimality when the objective function is not continuously differentiable. In this case the objective function is not differentiable at certain points, so those points can never be found by this method (the stationarity condition is $\log(p_i)=\mu_i-\lambda-1$ which cannot have $p_i=0$ (for some $i$) as a solution).

You can use that the minimum of a concave function over a convex set is attained at an extreme point of the convex set. This is known as the maximum principle. The extreme points are the vectors with $p_i=1$ for some $i$ and $p_i=0$ elsewhere. By comparing the function value at those points, you find that they are all globally optimal.

How do I show that the optimal p is given by the uniform distribution using the Lagrange method.

You should do what Nown did to find potential local optima. Then you should compare the function value with the function values at all points for which the objective function is not differentiable (or the derivative is not continuous), because the Lagrange method may not find those points. That's a lot of work here because the derivative does not exist when $p_i=0$ for at least one $i$. I would argue that the Lagrange method is just not suitable to solve this problem.

Maximizing a symmetric concave function over the standard simplex is an easy task if you follow this answer.

4
On
  1. When applying the Lagrange method, you need to cancel the gradient, which yields $ - \log(p_i) - 1 - \lambda = 0$ and $1 - \sum^{10}_{i=1}p_i = 0$. The first equality yields $p_i =e^{-1-\lambda}$ and the second one yields $\sum^{10}_{i=1}p_i = 1$. So we want to have $\sum^{10}_{i=1} e^{-1-\lambda} = 1$, or, in other words, $10 e^{-1-\lambda} = 1$. Thus, substituting $e^{-1-\lambda} = \frac 1 {10}$ into the expression $p_i = e^{-1-\lambda}$ gives $p_i = \frac 1 {10}$, as we expect.

  2. You cannot use your Lagrangian to find the minimum, because the minima (which are indeed all 10 Dirac distributions) lie at the "boundary" of the constraints $p_i \geq 0$ and $p_i \leq 1$, and your Lagrangian does not know anything about these constraints.

Update 1: the Lagrangian actually knows about the constraints $p_i \geq 0$ because otherwise it is not defined, but it does not know about the constraints $p_i \leq 1$. In fact the Lagrangian method here would minimize over the set $\{(p_1, \dots, p_n)\in {\mathbb R_+^{10}}, \sum^{10}_{i=1}p_i = 1\}$, which is larger than $\{(p_1, \dots, p_n)\in [0,1]^{10}, \sum^{10}_{i=1}p_i = 1\}$, and the Dirac measures are not minimizers over the larger space.

Update 2 (following LinAlg's comment and answer): you could try to change your Lagrangian to take the inequalities into account, see for example these notes, but then you would run into trouble due to the fact that $x\mapsto x\log x$ is not differentiable at 0.