What does $-p \ln p$ mean if p is probability?

1.9k Views Asked by At

In statistical mechanics entropy is defined with the following relation:

$$S=-k_B\sum_{i=1}^N p_i\ln p_i,$$

where $p_i$ is probability of occupying $i$th state, and $N$ is number of accessible states. I understand easily what probability is: for a frequentist it's just average frequency of getting this result. But I have a hard time trying to intuitively understand what $-p_i \ln p_i$ means. In the case where $p_i=p_j\; \forall i\ne j$ this reduces to $\ln N$, i.e. logarithm of number of accessible states.

But in general case of unequal probabilities, what does $-p_i\ln p_i$ really represent? Is it some sort of "(logarithm of) average number of accessible states"? Or maybe it's more useful to try to understand what $p_i^{p_i}$ is (but this seems even harder)?

2

There are 2 best solutions below

1
On BEST ANSWER

Let's say you wanted to compress the results of a sequence of independent trials into a sequence of bits.

Then the "ideal" encoding of the result of the trials would have $-\log_2 p_i$ bits for event $i$. This is in the limit, as the number of trials approaches infinity.

Now, what is the expected number of bits per trial? Then, since it is $-\log_2 p_i$ with probability $p_i$, the result is $-\sum p_i\log_2 p_i$. That is, if you want to encode $N$ occurrences of this event, you are going to require, on average, $-N\sum p_i\log_2 p_i$ with your absolutely best encoding.

You can see this most ideally when the $p_i$ are all if the form $\frac{1}{2^{k_i}}$.

For example, if $p_1=1/2, p_2=1/4, p_3=1/4$, then an "ideal" encoding has '0' for event $1$, $10$ for event $2$, and $11$ for event $3$. Then the expected bits per trial is $\frac{1}{2}\cdot 1 + \frac{1}{4}\cdot 2+\frac{1}{4}\cdot 2 = -\sum p_i\log p_i=\frac{3}{2}$. This means, with $N$ trials of this sort, the expected number of bits to store the results will be $\frac{3}{2}N$.

So entropy is also part of what mathematicians call "information theory." That is, the entropy of a system tells you how much (expected) information is needed to describe the results.

Now, if your probabilities are not so nice, then you'd have to encode smarter. For example, if $p_1=p_2=p_3=\frac{1}{3}$, then you wouldn't get "ideal" storage by storing the values one at a time. But, say, if you took five bits at a time, you could store three results, since in $5$ bits, there are $32$ values, and thus you could store any of $27$ results of each roll. In $8$ bits, you can store the result of $5$ trials. In $m$ bits, you can store $\log_3(2^m)$ results. So to store $n$ results, you need $m$ bits with $\log_3(2^m)\geq n$, which is $$m\geq \frac{n}{\log_3 2} = n\log_2 3 = -n\sum p_i\log_2 p_i$$

So $-p_i\log p_i$ is not really the significant thing. The significant thing is storing the result $i$ in $-\log p_i$ bits. In general, if you stored event $i$ as (an average of) $b_i$ bits, then the "expected" number of bits in a single trial would be:

$$\sum p_ib_i$$

It's just that the ideal storage, which minimizes the expected number of bits for a huge number of trials, is $b_i=-\log p_i$.

0
On

You can think of $S=S(p)$ as the complexity of the system, and of $-p_i\log p_i$ has the "potential complexity" of a particular state, although it is unusual to see one unrelated to the others.

A more conceptual answer is that all is devised so that the entropy of independent configurations adds: this is due to the fact that $f(x)=-x\log x$ is the only nonnegative function, up to a scalar multiple, satisfying $$ S(p*q)=S(p)+S(q),\quad \text{taking}\quad S(p)=\sum_{i=1}^Nf(p_i), $$ for all $p$ and $q$, where $p*q$ is the probability vector with the $N^2$ components $p_iq_j$.