Maximise entropy of random sample of random distribution

76 Views Asked by At

Let $X_1, ..., X_n$ be positive discrete r.v.s such that $p^j_k = \mathbb{P}(X_j=k)$. Let $N \in \{ 1, ..., n\}$ be a discrete r.v. with pmf $a_j=\mathbb{P}(N=j)$. Then $X_N$ is the random variable chosen by sampling a random $p^j$ distribution according to $a_j$.

Calculate $H(N, X_N)$ and, for $p^j_k$ fixed, find its maximum value and the value of $a_1, ..., a_n$ which achieves it.

I believe $$ H(N,X_N)=-\sum_{j=1}^n\sum_{k=1}^\infty a_jp^j_k\log(a_jp^j_k)$$

I would guess that this is maximised by $$a_j=\frac{H(X_j)}{\sum_{j=1}^nH(X_j)}$$ simply because it seems like giving a higher probability to the r.v.s with more entropy would maximise the entropy. How can I prove (or disprove) this, preferably using Lagrange multipliers?

1

There are 1 best solutions below

0
On BEST ANSWER

Your problem can be stated as:

$$\max_{a\in K} -\sum_{j=1}^n \sum_{k=1}^{\infty} a_j p_{jk}\log(a_j p_{jk})$$

where $K=\{a_1,\ldots,a_n\ge 0\,\mid\,a_1+\ldots+a_n=1\}$. This is a very favorable case of convex programming (objective is concave, constraints are affine) so for example in that case KKT conditions are necessary and sufficient conditions for optimality. So if we consider:

$$\begin{align} f(a)&=-\sum_{j=1}^n \sum_{k=1}^{\infty} a_j p_{jk}\log(a_j p_{jk})\\ g_i(a)&=-a_i\\ h(a)&=a_1+\ldots+a_n-1\end{align}$$

Notice that:

$$\begin{align}\nabla f(a)&=(-1-\log a_j+H(X_j))^T_{1\le j\le n}\\ \nabla g_i(a)&=-e_i\\ \nabla h(a)&=(1,\ldots,1)^T \end{align}$$

Then at optimal $a^*$:

  • Stationary condition: $-1-\log a^*_j+H(X_j)-\mu_j-\lambda=0$ for all $j$
  • Primal feasability: $a^*_j\ge 0$ for all $j$ and $a^*_1+\ldots +a^*_n=1$
  • Dual feasability: $\mu_j\ge 0$ for all $j$
  • Complementary slackness: $\mu_j a^*_j=0$ for all $j$

The only way all these conditions are simultaneous satisfied is when $a^*_j=Ce^{H(X_j)}$ (for some constant $C$), so in the end:

$$a^*_j=\frac{e^{H(X_j)}}{\sum_{j=1}^n e^{H(X_j)}}$$

is the optimal solution. You can get to the same result using Lagrange multipliers but you probably need to handle manually what happens at the boundary.