Understand better the entropy of Shannon

402 Views Asked by At

I'm not really sure what the entropy represent. In wikipedia, they say that the entropy is the quantity of information available and is defined as $\mathbb E[\log_b P(X)]$ where $P(X)$ is the mass function of $X$. But I don't really get where this formula comes from... Can someone explain ?


For example, let take a box with $4$ red balls, 2 yellow balls, 1 blue ball and 1 green ball. The entropy is $\frac{7}{4}$. What does this mean ? And in my course, it's written that we encode red balls with $0$, yellow balls with $01$, blue ball with $110$ and green ball with $111$.

Honestly theses numbers comes from nowhere for me. When red has only one digit, yellow has two digits and green and blue has 3 digit.

3

There are 3 best solutions below

2
On

Have a look at the book I recommended in the comments. It contains historical references, examples and many other useful results (including how the formula is/was derived).

On another note, if is helps, you can think of information entropy as a measurement of randomness (disorder or "missing" information). Let's look at the example in your question $$P(\text{red})=\frac{1}{2}, P(\text{yellow})=\frac{1}{4}, P(\text{blue})=P(\text{green})=\frac{1}{8}$$ and, as you calculated, the entropy of the random variable $X$ with possible values $V=\{\text{red},\text{yellow},\text{blue},\text{green}\}$ is

$$H(X)=-\sum\limits_{x\in V}P(X=x)\log_2{P(X=x)}=\\ -\frac{1}{2}\log_2{\frac{1}{2}}-\frac{1}{4}\log_2{\frac{1}{4}}-\frac{1}{8}\log_2{\frac{1}{8}}-\frac{1}{8}\log_2{\frac{1}{8}}=\frac{7}{4}$$

Indeed, $\frac{7}{4}$ may not say a lot at this point. But, imagine (another random variable $Y$) having $8$ balls, $4$ pairs of the same color each. Then
$$P(\text{red})=P(\text{yellow})=P(\text{blue})=P(\text{green})=\frac{1}{4}$$ and $$H(Y)=2$$ In this case, $2>\frac{7}{4}$ or $Y$ has "more randomness" (or more disorder) than $X$. This makes sense because, for example, you can't predict "red" with $P(\text{red})=\frac{1}{2}$ anymore. Or you don't have enough (or miss) information to predict the state of $Y$.

And finally imagine (a random variable $Z$) having $8$ balls of different colors. Then $$P(Z=\text{color}_i)=\frac{1}{8}$$ and $$H(Z)=3$$ In this case $3>2>\frac{7}{4}$ or $Z$ has higher entropy because it has more states ($8$, compared to $4$ states of $Y$).

Three important properties of entropy to remember are:

0
On

Entropy in this context is a measure of how information we learn from knowing the outcome of a random event - or, if you prefer, how much information it takes to describe that event, relative to our pre-event ignorance.

Shannon began by noting you can describe $2^n$ options with $n$ bits each, so an event of probability $2^{-n}$ needs $n$ bits. We can then make this notion continuous: a probability-$p$ event needs $-\log_2p$ bits, or $-\log_ep$ nats, or... well, if you decide to use something that can have $b$ values instead of $2$, we'll have to use base-$b$ logarithms throughout, but since $\log_bx=\frac{\log_2x}{\log_2b}$ we just scale all the results.

So let's not worry about the value of $b$ right now: for some definition of logarithms, an event of probability $p$ has "entropy" $-\log p$. (This term is borrowed from a branch of thermodynamics called statistical mechanics. Log-probabilities are additive for independent subsystems when we count possible states, so entropy is an "extensive" property.) Similarly, a scenario whose different outcomes have probabilities $p_i$ has mean entropy $-\sum_ip_i\log p_i$.

0
On

Shannon developed information theory because he was interested in sending communications as efficiently as possible. The idea is to somehow structure the signals so that shorter signals are used to report more probable events, and longer signals to report rare events. If an event is certain to occur, there is not need to report it at all, because the receiver already knows it. (Why would we waste power transmitting, "The sun rose in the east today?")

Now let's suppose that we repeatedly draw balls from the urn, with replacement, and signal the results with the codes your teacher gave. How many bits to we use on average to report a single event? Clearly $$1\cdot\frac12+2\cdot\frac14+2\cdot3\cdot\frac18=\frac74$$ Information theory shows that this is the best we can do.