Lagrange multipliers for discrete random variables.

684 Views Asked by At

In "Elements of Information Theory" (author Thomas Cover) problem 2.30 reads:

Find the probability mass function $p(x)$ which maximizes the entropy $H(X)$ of a non-negative integer valued random variable $X$ subject to the constraint $$F(p)=E[X]=\sum_{n=0}^\infty np(n) = A$$

My approach (confirmed after some internet sleuthing for written solutions online) involves applying Lagrange multipliers to the function $$H(p)=-\sum_{n=0}^\infty p(n)\log p(n)$$ subject to the constraint above as well as $$G(p)=\sum_{n=0}^\infty p(n)=1.$$

That is, assuming there exist scalars $\lambda_1,\lambda_2$ such that $\nabla F(p)=\lambda_1\nabla H(p)+\lambda_2\nabla G(p)$, and using that (with the constraints) to solve for $p$.

I'm worried about the formal validity of applying Lagrange multipliers in this case, since these functions actually live on the infinite-dimensional space $\mathbb{R}^{\mathbb{N}}$ and I don't think the usual arguments justifying the method applies there. I'm also confused as to why the solutions I've seen think it's sufficient to appeal to the concavity of $H(p)$ to argue that a single critical point must yield a global maximimum. There are choices of $p$ for which $H(p)$ is infinite, so surely it isn't clear a priori that these constraints imply a single critical point is a global max.

My question is: is there a formally valid way to use Lagrange multipliers to solve this problem? In what generality can we apply such techniques on real-valued functions with domain $\mathbb{R}^{\mathbb{N}}$?

1

There are 1 best solutions below

0
On

I was thinking about the same problem this week, and I think I found a way to justify the use of Lagrange multipliers, although there are steps I couldn't formally justify (and I hope someone can complete my answer).

Let $S$ be a Hilbert space over $\mathbb{R}$ and $J: S \to \mathbb{R}$ a functional to be maximized. Let $y \in S$ be a local maximum of $J$ and $\eta \in S$, $\epsilon \in \mathbb{R}$. We have:

\begin{equation} \label{eq:eq1} \tag{1} J(y + \epsilon\eta) = g(\epsilon) = g(0) + g'(0)\epsilon + o(\epsilon) \end{equation}

For $y$ to be a local maximum, we must have:

\begin{equation} \label{eq:eq2} \tag{2} g'(0) = 0 \implies \frac{\mathrm{d}}{\mathrm{d}\epsilon}J(y + \epsilon\eta) \bigg\rvert_{\epsilon=0} = \delta J_y(\eta) = 0 \end{equation}

Suppose we include a constraint $F: S \to \mathbb{R}$:

\begin{equation} \label{eq:eq3} \tag{3} F(y) = C \end{equation}

Equation \ref{eq:eq2} must be satisfied for every $\eta$ such that $F(y + \epsilon\eta) = C$. We have:

\begin{equation} \label{eq:eq4} \tag{4} \frac{\mathrm{d}}{\mathrm{d}\epsilon}F(y + \epsilon\eta) = \frac{\mathrm{d}}{\mathrm{d}\epsilon}F(y + \epsilon\eta) \bigg\rvert_{\epsilon=0} = \delta F_y(\eta) = 0 \end{equation}

Then, to find local maximum candidates we have to find values of $y$ such that:

\begin{equation} \label{eq:eq5} \tag{5} \delta J_y(\eta) = 0 \quad \forall \eta \mbox{ such that } \delta F_y(\eta) = 0 \end{equation}

Returning to the entropy problem, let's consider that the probability mass function $p = (p_0, p_1, ...)$ belongs to the Hilbert space of non-negative square summable sequences $\ell^2$ with the inner product $\langle x,y\rangle= \sum_{i=0}^\infty x_i y_i$.

We wish to maximize the entropy $H: \ell^2 \to \mathbb{R}_+$ (where I take the absolute value of $p_i$ because $\ell^2$ may contain sequences with negative elements):

\begin{equation} \label{eq:entropy} H(p) = -\sum_{i=0}^\infty |p_i| \log |p_i| \end{equation} subject to the constraints

\begin{equation} \label{eq:constraints} \begin{split} & F(p) = \sum_{i=0}^\infty p_i = 1 \\ & G(p) = \sum_{i=0}^\infty ip_i = A \end{split} \end{equation}

We take $q \in \ell^2$ and $\epsilon \in \mathbb{R}$ and we suppose that the optimum $p$ is such that $p_i > 0$ for all $i$ (and that $\epsilon$ is small enough to ensure that $p_i + \epsilon q_i \geq 0$ for all $i$). According to \eqref{eq:eq5}, we have:

\begin{equation} \begin{split} \delta H_p(q) &= \frac{\mathrm{d}}{\mathrm{d}\epsilon}H(p + \epsilon q) \bigg\rvert_{\epsilon=0} \\ &= - \frac{\mathrm{d}}{\mathrm{d}\epsilon} \left( \sum_{i=0}^\infty (p_i+\epsilon q_i) \log (p_i+\epsilon q_i) \right)\bigg\rvert_{\epsilon=0} \\ & = -\sum_{i=0}^\infty (1 + \log p_i)q_i = 0 \end{split} \end{equation} for every $q$ such that

\begin{equation} \begin{split} &\delta F_p(q) = \sum_{i=0}^\infty q_i = 0 \\ &\delta G_p(q) = \sum_{i=0}^\infty iq_i = 0 \end{split} \end{equation}

We conclude that $1 + \log p_i$ must be equal to $\lambda_1 + i\lambda_2$, where $\lambda_1$ and $\lambda_2$ are real constants (the Lagrange multipliers), otherwise we cannot ensure that $\delta H_p(q)$ will be zero. This way, we have:

\begin{equation} \delta H_p(q) = -\sum_{i=0}^\infty (1 + \log p_i)q_i = -\sum_{i=0}^\infty (\lambda_1 + i\lambda_2)q_i = -\lambda_1\sum_{i=0}^\infty q_i - \lambda_2\sum_{i=0}^\infty iq_i = 0 \end{equation}

Summing it up, if the probability mass function $p$ is a local maximum of $H(p)$, it must satisfy:

\begin{equation} \label{eq:conclusion} \begin{cases} &1 + \log p_i = \lambda_1 + i\lambda_2, \quad \forall i \in \mathbb{N} \\ &\sum_{i=0}^\infty p_i = 1\\ &\sum_{i=0}^\infty ip_i = A \end{cases} \end{equation}

Which is the same as setting to zero the "gradient" of :

\begin{equation} \label{eq:lagrange} L(p, \lambda_1, \lambda_2) = -\sum_{i=0}^\infty p_i \log p_i + \lambda_1\left( \sum_{i=0}^\infty p_i - 1\right) + \lambda_2\left( \sum_{i=0}^\infty ip_i - A\right) \end{equation}

As I said before, there are steps that need to be justified more rigorously and, also, I am not sure about how to use the concavity of $H(p)$ to prove that the probability function that is found using this method is indeed the global optimum, and I hope someone will be able to help me with that!

A few useful references: