Upper Bounding the Discrete Entropy by the Expectation

Question

Upper Bounding the Discrete Entropy by the Expectation

331 Views Asked by Bumbble Comm At 28 Mar 2026 - 11:42

Let $\mathcal P$ be the set of probability mass functions (pmfs) on $\mathbb Z_{>0}$, i.e. for $p=(p(x))_{x\in\mathbb Z_{>0}}\in\mathcal P$ we have $p\ge 0$ and $\sum_{x=1}^\infty p(x)=1$. Let $H(p)=-\sum_{x=1}^\infty p(x)\ln(p(x))$ be the entropy and $E(p)=\sum_{x=1}^\infty p(x)x$ the expectation. Further, let $s(p)=|\{x\in\mathbb Z_{>0}:p(x)>0\}|$ be the size of the support of $p$.

For $s\in\mathbb Z_{>0}$ let $\mathcal P_s=\{p\in\mathcal P:s(p)=s\}$, and further let $\mathcal P_\infty=\{p\in\mathcal P:s(p)=\infty,E(p)<\infty\}$.

Question: What are the best bounds for $r(s)=\sup_{p\in\mathcal P_s}H(p)/E(p)$?

Motivation: I want good upper bounds for the entropy in terms of the support size and the expectation.

Background: The question only makes sense if we consider strictly positive random variables, otherwise $r(s)$ would be infinite, which can be seen by taking limits towards the one-point mass on $0$.

For this question we can assume that $p$ is supported on $\{1,\dots,s\}$, respectively $\mathbb Z_{>0}$ for $s=\infty$, and non-increasing, since ordering the weights minimizes the expectation while preserving the entropy.

For $s<\infty$ we have $r(s)\le\ln(s)$ because $H(p)\le\ln(s)$ is maximal for the uniform distribution and $E(p)\ge 1$. Of course, with the uniform distribution we also get a lower bound, namely $r(s)\ge 2\ln(s)/(s+1)$, which is not tight because the entropy is stationary at the uniform distribution while the expectation is not. Also, we know that the supremum is attained due to continuity. Of course, identifying the maximizers would be highly desirable.

For $s=\infty$ it is known that $H(p)=E(p)=\infty$ is possible, as discussed here. As can be seen here, there are also quite a few follow-up questions. Unfortunately, I am not convinced by the given answer, and am still not aware of an answer to the question if $H(p)<\infty$ for all $p\in\mathcal P_\infty$. Should this be true, we may of course still have $r(\infty)=\infty$, and in any case explicit maximizing sequences would be highly desirable.

Finally, a similar question regarding a lower bound can be found here.

Update: A limiting argument directly yields that $r(s+1)\ge r(s)$. As discussed here, we have $H(p)\le\ln(E(p)+0.5)+1$ given by Theorem 8 in this preprint. The map $f(x)=(\ln(x+0.5)+1)/x$ is decreasing on $[1,\infty)$, with $f(x)=1$ for $x\approx 1.858$. Since we can assume that $p$ is non-increasing, we have $E(p)\le\frac{1+s}{2}$.

We clearly have $r(1)=0$ and a discussion of $f(p_1)=\frac{H(p_1)}{p_1+2(1-p_1)}$ gives the maximizer $p_1=\frac{1}{2}(\sqrt 5-1)\approx 0.618$, the expectation $E(p)\approx 1.382$ and $r(2)\approx 0.481$. For $s=3$ we fix $\mu\in(1,2]$ and consider $p_1\ge p_2\ge p_3\ge 0$ with $p_1+p_2+p_3=1$ and $E(p)=\mu$. Set $p_3=x$ and observe that $p_1=2-\mu+x$, $p_2=\mu-1-2x$, and that $\max(0,\frac{2}{3}\mu-1)\le x\le\frac{1}{3}(\mu-1)$. The derivative $\ln(\frac{p_2^2}{p_1p_3})$ of the entropy on this restriction is decreasing with exactly one root $x=\frac{1}{2}\mu-\frac{1}{3}-\frac{1}{6}\sqrt{4-3(2-\mu)^2}$. Numerical evaluation gives \begin{align*} p_1&\approx 0.544\\ p_2&\approx 0.296\\ p_3&\approx 0.161\\ E(p)&\approx 1.617\\ r(3)&\approx 0.609. \end{align*}

Original Q&A

There are 2 best solutions below

Bumbble Comm On 06 Feb 2023 - 5:23

Finite mean implies finite entropy. The proof is in Golomb, Scholtz and Peile's book Information Theory:Adventures of Agent 00111 (Theorem 4.1). There are copies online if you dig around, I have no time to type in details, so with apologies I give the indications below:

Main idea of the proof:

The authors also note that since the mean is sensitive to rearrangements of the probability distribution sequence but the entropy isn't one can also show that if $\{p_n\}$ has infinite entropy all its rearrangements have infinite mean, but the converse does not hold.

**Bumbble Comm** · Accepted Answer

If we weaken the condition to $s(p) \le t,$ i.e., study $\tilde{r}(t) := \sup H(p)/E(p) : s(p) \le t,$ then observe that to minimise $E$ for a given $t$, the support must lie in $[1:t]$. Now, if we fix a value of $E(p) = \mu \in [1,t],$ and consider the maximiser of $H(p),$ for such $p$, then it's just Lagrange multipliers to argue that this is optimised by a geometric law supported on $[1:t]$ Since this is true for each $\mu,$ this is also the form of the maximiser of $\tilde{r}(t)$. Further since the law has full support, this is also the form of the maximiser of $r(t)$ for any $t \ge 1$ under the condition that $E(p) = \mu \in [1,t]$. Let's call the max entropy for a given mean $\mu$ and $t$ to be $h(\mu,t)$. Then it follows that $$r(t) = \max_p H(p)/E(p) = \max_{\mu \in [1,t]} \max_{p : E(p) = \mu} H(p)/\mu = \max_{\mu \in [1,t]} H(\mu,t)/\mu.$$ In general, then, the optimal law is a geometric law on $[1:t]$.

Going beyond this turns into a mess because the geometric law on a finite set is messy, but nevertheless, this does lead to interesting conclusions.

Setup. Consider a geometric law $$ p(n; \eta, t) := \frac{1-\eta}{1- \eta^t}\eta^{n-1} \mathbf{1}\{n \in [1:t]\}.$$ Note that here that each $\eta > 0$ gives a valid law. $\eta = 1$ gives the uniform law, $\eta > 1$ gives laws skewed towards $t$ and $\eta < 1$ gives laws skewed towards $1$, with the limiting laws supported entirely on $1$ (as $\eta \to 0$) and on $t$ (as $\eta \to \infty$). This mean can be computed as $$ \mu(\eta,t) := \frac{1}{1-\eta} -\frac{t\eta^t}{1 - \eta^t},$$ while the entropy of $p(n;\eta,t)$ is $$ H(\eta,t) := - \log \frac{1-\eta}{1-\eta^t} - \log\eta( \mu(\eta,t)- 1).$$

Since all of the optimal laws are parametrised by $\eta$, maximising the ratio $h(\mu,t)/\mu$ over all $\mu$ is equivalent to the program maximising $H(\eta,t)/\mu(\eta,t)$ over all $\eta,$, i.e., $$ r(t) = \max_\eta \frac{H(\eta,t)}{\mu(\eta,t)}.$$ The first order condition for this program is $$ H'(\eta,t) \mu(\eta,t) - H(\eta,t) \mu'(\eta,t) = 0,$$ where $H'(\eta,t) = \partial_\eta H(\eta,t)$ and $\mu'(\eta,t) = \partial_\eta \mu(\eta,t).$

Developing the first-order condition. Computing the derivative, we find that $$H'(\eta,t) = \frac{1}{1-\eta} - \frac{t\eta^{t-1}}{1-\eta^t} - \frac{1}{\eta}(\mu(\eta, t) - 1) - \log \eta (\mu'(\eta, t)) = - \mu'(\eta,t) \cdot \log \eta.$$

Plugging this in, the first order condition is $$ \mu'(\eta,t)( \mu(\eta, t) \log \eta + H(\eta, t)) = 0. $$

Now, observe that $$ \mu'(\eta, t) = \frac{1}{(1 - \eta)^2} - \frac{t^2 \eta^{t-1}}{(1-\eta^t)^2}.$$ I claim that this has no roots except at $\eta = 1$. This should be intuitive, but more formally, observe that for any fixed $\eta > 0,$ the function $$ g(t) := \frac{(1-\eta^t)^2}{t^2 \eta^{t-1}} = 4\eta (\sinh(t\log(\eta)/2)/t)^2 $$ is monotone in $t$ for $t \ge 1$. Indeed, $g(t) = 4\eta \log^2(\eta/2) h(t \log(\eta)/2)^2$ for $h(z) := \sinh(z)/z,$ and a simple derivative argument along with the fact that $\tanh(x) \le x$ suffices to show that the sign of $h(t\log(\eta)/2)$ is a constant, and its magnitude is increasing with $t$. So, the only way that $\mu'(\eta, t) =0$ for $t \ge 2$ is if $\eta = 1$.

Optimal $\eta$. This leaves the condition $ H(\eta,t)+ \log(\eta) \mu(\eta,t) = 0.$ This translates to the equation $$ \log \frac{1-\eta}{1-\eta^t} = \log \eta \iff 1-2\eta + \eta^{t+1} = 0.$$ This equation has exactly two solutions for $t \ge 2,$ one at $1$, and one in $(1/2, 1)$. Let $\eta_*(t)$ denote the solution in $(1/2,1)$. I claim that this is the optimal choice of $\eta$, at least for $t \ge 8$. Indeed, $\eta = 1$ yields the uniform law, and this does not maximise $r$ for $t \ge 8$ (the uniform law on $\{1,2\}$ achieves a greater ratio). Further, the limits $\eta \to 0$ and $\eta \to \infty$ both yield laws that concentrate on $1$ point, and so have $0$ entropy (and thus $0$ ratio), while the above solution has non-zero ratio. Thus, no other point can be optimal for $t \ge 8$ (in general, of course, we can just check the curvature, but I don't want to deal with that :P).

The behaviour of $r$. Notice, interestingly, that since at the optimal $\eta_*(t), H(\eta_*(t) ,t) + \log\eta_*(t) \cdot \mu(\eta_*(t),t) = 0,$ we can immediately conclude that $$ r(t) = -\log \eta_*(t).$$

While it's hard in general to say anything more, since $\eta_*(t)$ is difficult to nail down, we can study asymptotics quite cleanly. Indeed, let $g_t(\eta) := 1 - 2\eta + \eta^{t+1}$. Then notice that for large $t$, taking a Taylor expansion near $1/2,$ $$ g_t(1/2 + \varepsilon) = 2^{-(t+1)} + (-2 + (t+1) 2^{-t}) \varepsilon + t(t+1) 2^{-t} \varepsilon^2 + O(\varepsilon^3), $$ which means that for $t \gg 1,$ $\eta_*(t) \approx 2^{-1} + 2^{-t},$ which yields $r(t) = - \log\eta_*(t) \approx \log(2) - 2^{-(t-1)}.$

Further, we can show that $r(t) < \log(2)$ for all $t$ (if the entropy is measured in bits, then this is $r(t) < 1$, leonbloy's conjecture from the comments).

To show this, it suffices to argue that $\eta_*(t) > 1/2,$ i.e., that $g_t(\eta) = 0$ does not have any roots $\le 1/2.$ Indeed, $$ g_t'(\eta) = -2 + (t+1) \eta^{t}$$ is strictly increasing with $\eta$, and since $2^x \ge x$ for every $x$, it follows that $g_t'(1/2) < 0,$ which in turn implies that $ \forall \eta \in [0,1/2], g_t'(\eta) < 0.$ Therefore, $$\min_{\eta \in [0,1/2]} g_t(\eta) = g_t(1/2) = 2^{-(t+1)} > 0,$$ and thus there is no root in $[0,1/2]$, i.e., $\eta_*(t) > 1/2.$

Upper Bounding the Discrete Entropy by the Expectation

There are 2 best solutions below

Related Questions in PROBABILITY

Related Questions in PROBABILITY-DISTRIBUTIONS

Related Questions in EXAMPLES-COUNTEREXAMPLES

Related Questions in INFORMATION-THEORY

Related Questions in ENTROPY

Trending Questions

Popular # Hahtags

Popular Questions