Measure of information in a particular example

136 Views Asked by At

Consider one experiment with two possible results described by a random variable $X$ attaining values $x_1 = 0$ or $x_2 = 1$ to capture the two possible outcomes.

Every such random variable is parametrized by a number $p\in [0,1]$ giving rise to probabilities

$$P(X = x_1)=p,\quad P(X=x_2)=1-p.$$

One such experiment would be a coin being tossed and the results analyzed, or a particle of spin $1/2$ having its intrinsic angular momentum observed.

Now, define the entropy $S_X$ by

$$S_X=-\sum_{i}p_i \log p_i=-p\log p -(1-p)\log (1-p).$$

It has the following properties

  1. It is minimum when $p =0$ or $p=1$ having value $0$;
  2. It is maximum when $p=1/2$ on which it attains the value $\log 2$.
  3. It is monotonicaly increasing in the interval $[0,1/2]$ and monotonicaly decreasing in the interval $[1/2,1]$.

Now, as is usually discussed this clearly is a measure of uncertainty in the results of the experiment. This follows from the above properties:

If $p = 0$ or $p=1$ we are assured of the result already, so there is no uncertainty. If $p =1/2$ both results are equaly likely and we are maximaly uncertain of what result may come out. Any other $p$ in the middle shifts the balance of likeliness towards one result and this reduces our uncertainty, albeit not eliminating it.

That is fine. But one usually says that $S_X$ is a measure of information.

But I fail to get the intuition here. Why is $S_X$ a measure of information? What is the intuition for $p =0,p=1$ giving no information $0 < p < 1$ giving some information and $p = 1$ giving maximal information?

How does one interpret here $S_X$ as a measure of information?

1

There are 1 best solutions below

3
On BEST ANSWER

It's not $p$ that gives information but the experiment. In an experiment described by a discrete probability distribution $p_i$, each time you perform it, you gain $-\sum_ip_i\log_2p_i$ bits of information. For instance, in your setup with $p=\frac12$, this is $1$, so you gain exactly one bit of information when you flip a coin. You need one bit to describe each result, you can't compress that any further. On the other hand, for $p=0$ and $p=1$ you don't gain any information, and you need $0$ bits to describe the results of the experiments, since they're already fully known without performing the experiments.

To consider a more interesting example, say you know that $0$ is more likely, so you encode triples of bits by different lengths of bit strings:

\begin{array}{c|c} \text{triple}&\text{code}\\\hline 000&0\\ 001&100\\ 010&101\\ 100&110\\ 110&11100\\ 101&11101\\ 011&11110\\ 111&11111 \end{array}

Note that this is a prefix code, so the results can be unambiguously decoded from the encoded string. (See also Huffman coding.)

Now the expected number of bits you need per experiment is

$$ \frac13\left(p^3+3\cdot3p^2(1-p)+5\cdot3p(1-p)^2+5(1-p)^3\right)=\frac13\left(2p^3-6p^2+5\right)\;. $$

Here's a plot of this function compared to the information $-p\log_2p-(1-p)\log_2(1-p)$. Note how for small $p$, the code is very inefficient (since it's designed for $p\gt\frac12$), and for $p$ near $1$ it's also far from optimal (since it still requires $1$ bit per $3$ bits even when there's actually no information) – but there's a point in between, near $p=0.8$, where you might actually think from that plot that the code achieves the optimal compression corresponding to the upper bound given by the information. This plot shows, however, that that's not quite the case; the code is very good around $p=0.8$, but it doesn't quite achieve the bound. If you did the same thing with larger tuples, you could get even closer to the bound; and arithmetic coding theoretically achieves the bound – but you can never compress the results into fewer bits than given by their information content.

P.S.:

One would expect the value of $p$ at which the above code is near optimal to be near $p=\sqrt[3]{\frac12}\approx0.7937$, since that would make the probability for the triple $000$ that's encoded by a single bit come out as $\frac12$. Here's a table with the probabilities $p_i$ of the triples and their negative logarithms $-\log_2p_i$ (for $p=\sqrt[3]{\frac12}$), compared with the numbers of bits the code actually uses:

\begin{array}{c|c} \text{triple}&\text{code}&p_i&\log_2p_i&\text{bits}\\\hline 000&0&\frac12&1&1\\ 001&100\\ 010&101&\left(\frac12\right)^{\frac23}\left(1-\left(\frac12\right)^{\frac13}\right)\approx0.1300&2.944&3\\ 100&110\\ 110&11100\\ 101&11101&\left(\frac12\right)^{\frac13}\left(1-\left(\frac12\right)^{\frac13}\right)^2\approx0.0338&4.888&5\\ 011&11110\\ 111&11111&\left(1-\left(\frac12\right)^{\frac13}\right)^3\approx0.0088&6.832&5 \end{array}