Interpretation of Shannon Entropy formula as expected value

Question

Interpretation of Shannon Entropy formula as expected value

823 Views Asked by Bumbble Comm At 02 Apr 2026 - 1:39

For a random variable $X$ with potential outcomes $\{x_1,x_2,...,x_n\}$ and probabilities $p(x_1),p(x_2),...,p(x_n)$ the formula for the Shannon Entropy is $H(X) = -\sum_{x \in X}{p(x) log_{2}(p(x))} $.

If I denote $b(x) = -log_2(p(x))$, then the formula becomes:

$$H(X) = \sum_{x \in X}{p(x) (-log_{2}(p(x)))}= \sum_{x \in X}{p(x) b(x)} $$

Now, Cover & Thomas write: "The entropy is a measure of the average uncertainty in the random variable. It is the number of bits on average required to describe the random variable." We also know that the expected value of a function $f$ of a random variable $X$ is $E_X[f(x)] = \sum_{x \in X}{(p(x)f(x))}$.

Thus, the formula for entropy can be rewritten as the expected value of $b(X)$ that is:

$$H(X) = E_X[b(x)] = \sum_{x \in X}{(p(x)b(x))}$$

Questions:

Is it right to think of $H(X)$ as $E_X[b(x)]$ or is this just an algebraic coincidence?
If 1. is correct, then is it also correct that $b(X=x)$ is the number of bits required to describe the potential outcome $x$ of $X$?
If 2. is correct, why do potential outcomes with $p(x)=1$ require 0 bits, $p(x)=0.5$ require 1 bit, etc. ? Essentialy, why use $b(x)$ and not another function?
Is it a correct interpretation to state that potential outcomes with $p(x)=0$ require an infinite amount of bits?
Is there as specific name for the function $b(x) = -log_2(p(x))$?

Note: I 've already researched previous questions on Shannon Entropy but none goes directly to the issue of the interpretation of $H(x)$ as $E_X[b(x)]$.

Original Q&A

There are 2 best solutions below

Bumbble Comm On 29 May 2022 - 2:23

Yes, it's the most important justification for defining entropy in the first place.
Yes, but only in a systematic way. You could use just $1$ bit to transmit any information, including combined ones. For example, if I send you a $0$ tomorrow, that would mean bitcoin values drop by 50% and WW3 break out. Unfortunately, as the rather rich information has such low probability, it would be a total waste of a valuable single bit to encode it, as all the other more likely messages have to be sent with longer codes. That's why taking average is important for knowing how many bits are actually needed for a single outcome (which by the way, could be further divided in reality).
Because if $p(x)=1$, we both know it happens for sure (At least almost surely), and hence there is no need to transmit this information. If $p(x)=0.5$, like flipping a coin, then it takes optimally 1 bit to transmit, but this needs further clarification, as after all the expectation value depends on the systematic way to design the coding, not just a single event.
Yes, if something doesn't happen at all, you would not expect to transmit the message in the first place, so you won't include them in your design of codes. (Note that this only applies to discrete probability distributions that have only finitely many outcomes, not say a Gaussian distribution.)
You don't want a name but a justification. Let's do it now.

We want to find the function $b$ such that $E_X(b) = \sum_x p(x)b(x)$ is minimized. The important thing here is to figure out the constraint. Of course $c(x)\ge 0$ (and $c(x)$ also has to be integers, but we'll ignore the problem for now). We also need to make sure the encoding is unambiguous, and this turns out to be equivalent to the so-called Kraft–McMillan inequality: $\sum_x 2^{-b(x)}\le 1$. The proof is not difficult and we shall skip. Let $c(x) = 2^{-b(x)}$, then we have the following optimization problem $\min_{c} E_x(b) = -\sum_x p(x) \log_2 c(x)$ that is equivalent to $$\max_c \sum_x p(x)\log_2 c(x)$$ subject to the constraint $$\forall x, c(x))\ge 0, \text{ and }\sum_x c(x)= 1$$

Here we used $\sum_x c(x)=1$ instead of $\le$ to get an upper bound of the problem.

The solution can be found by Lagrange multiplier, which is exactly $c(x)=p(x), \forall x$. This shows that $H(x)$ as defined is an upper bound for the average coding length. Due to $b(x)$ has to be integers, this upper bound may not be achieved, and we might need one extra bit, which can always be realized by the Huffman code.

On the other hand, if we are allowed to send combined iid results (which is the case for the task of compressing fixed texts, instead of real time long distance communication), we may find $n$ such that $n H(x)$ is very close to an integer than $H(x)$ to approximate $H(x)$ as close as we want. This is essentially Shannon's source coding theorem.

This is just one way to justify the definition of $H(x)$. There are other ways using typical texts, see e.g. Witten's exposition.

**Bumbble Comm** · Accepted Answer

5.The name of function $b(x)$ is surprisal notes Intuitively, smaller probability events have a larger surprise. So you can say $H(X)$ is the expected value of "surprisal".

Yeah, It's a way to interpret entropy, using the algebraic formulation of expectation.
It's correct. For the interpretation of $b(x)$ the note interpreted as "the amount of surprise" you have given the outcome $x$. There are quite a few equivalent interpretations of entropy.

The perspective 5 in CMU notes is the closest to your version: "5. Number of yes/no questions needed on average to guess a draw from X". Then you could interpret $b(x)$ as, with an optimized sequence of questions, how many questions do I need to get to the value $x$.

Then when $p(X=A)=1$ you need 0 questions to know the value of $X$, it's certain. When $p(X=A)=0.5$ your best strategy is to ask "is $X$ value $A$ or not?" then you need 1 question to know its value. (why this is the best strategy? see this)
Similarly, you shall not ask questions for $X$ value that $P(X=A)=0$, so you need infinite questions to get to those values.

Interpretation of Shannon Entropy formula as expected value

There are 2 best solutions below

Related Questions in INFORMATION-THEORY

Related Questions in ENTROPY

Trending Questions

Popular # Hahtags

Popular Questions