For a random variable $X$ with potential outcomes $\{x_1,x_2,...,x_n\}$ and probabilities $p(x_1),p(x_2),...,p(x_n)$ the formula for the Shannon Entropy is $H(X) = -\sum_{x \in X}{p(x) log_{2}(p(x))} $.
If I denote $b(x) = -log_2(p(x))$, then the formula becomes:
$$H(X) = \sum_{x \in X}{p(x) (-log_{2}(p(x)))}= \sum_{x \in X}{p(x) b(x)} $$
Now, Cover & Thomas write: "The entropy is a measure of the average uncertainty in the random variable. It is the number of bits on average required to describe the random variable." We also know that the expected value of a function $f$ of a random variable $X$ is $E_X[f(x)] = \sum_{x \in X}{(p(x)f(x))}$.
Thus, the formula for entropy can be rewritten as the expected value of $b(X)$ that is:
$$H(X) = E_X[b(x)] = \sum_{x \in X}{(p(x)b(x))}$$
Questions:
- Is it right to think of $H(X)$ as $E_X[b(x)]$ or is this just an algebraic coincidence?
- If 1. is correct, then is it also correct that $b(X=x)$ is the number of bits required to describe the potential outcome $x$ of $X$?
- If 2. is correct, why do potential outcomes with $p(x)=1$ require 0 bits, $p(x)=0.5$ require 1 bit, etc. ? Essentialy, why use $b(x)$ and not another function?
- Is it a correct interpretation to state that potential outcomes with $p(x)=0$ require an infinite amount of bits?
- Is there as specific name for the function $b(x) = -log_2(p(x))$?
Note: I 've already researched previous questions on Shannon Entropy but none goes directly to the issue of the interpretation of $H(x)$ as $E_X[b(x)]$.
5.The name of function $b(x)$ is surprisal notes Intuitively, smaller probability events have a larger surprise. So you can say $H(X)$ is the expected value of "surprisal".
Yeah, It's a way to interpret entropy, using the algebraic formulation of expectation.
It's correct. For the interpretation of $b(x)$ the note interpreted as "the amount of surprise" you have given the outcome $x$. There are quite a few equivalent interpretations of entropy.
The perspective 5 in CMU notes is the closest to your version: "5. Number of yes/no questions needed on average to guess a draw from X". Then you could interpret $b(x)$ as, with an optimized sequence of questions, how many questions do I need to get to the value $x$.
Then when $p(X=A)=1$ you need 0 questions to know the value of $X$, it's certain. When $p(X=A)=0.5$ your best strategy is to ask "is $X$ value $A$ or not?" then you need 1 question to know its value. (why this is the best strategy? see this)
Similarly, you shall not ask questions for $X$ value that $P(X=A)=0$, so you need infinite questions to get to those values.