Wallis' derivation of information entropy

82 Views Asked by At

In Wallis' combinatorial derivation of the information entropy (see here or wiki, the text is copied down below), I don't understand what the information $I$ is in

If it happens to conform to the information $I$, we accept it;

it seems very vague to me...

full text

A second and perhaps more intuitive approach to deriving entropy was suggested by $G$. Wallis. We are given information $I$, which is to be used in assigning probabilities $\{p_1,...,p_m\}$ to $m$

different probabilities. We have a total amount of probability

$$\sum_{i=1}^m p_i=1$$

to allocate among them.

The problem can be stated as follows. Choose some integer $n>>m$ , and imagine that we have $n$ little quanta of probabilities, each of magnitude $\delta=\frac1n,$

to distribute in a way we see fit.

Suppose we were to scatter these quanta at random among the $m$ choices (penny-pitch game into $m$

equal boxes). If we simply toss these quanta of probability at random, so that each box has an equal probability of getting them, nobody can claim that any box is being unfairly favored over any other.

If we do this and the first box receives exactly $n_1$ quanta, the second $n_2$ quanta etc., we will say the random experiment has generated the probability assignment:

$$p_i=n_iδ=\frac{n_i}{n},\text{ with }i=1,2,...,m.$$

The probability that this will happen is the multinomial distribution:

$$m^{−n}n!n_1!⋅...⋅n_m!$$

Now imagine that we repeatedly scatter the $n$ quanta at random among the m boxes. Each time we do this we examine the resulting probability assignment. If it happens to conform to the information $I$, we accept it; otherwise, we reject it and try again. We continue until some probability assignment $\{p_1,...,p_m\}$

is accepted.

What is the most likely probability distribution to result from this game? It is the one which maximizes

$$W=n!n_1!⋅...⋅n_m!$$

subject to whatever constraints are imposed by the information $I$.

We can refine this procedure by using smaller quanta, i.e., large $n$. By using Stirling’s approximation

$$n!∼\sqrt{2πn}\left(\frac{n}{e}\right)^n,$$

and taking the logarithm from it

$$log(n!)∼\sqrt{2πn}+nlog\left(\frac{n}{e}\right),$$

we have

$$log(n!)∼\sqrt{2πn}+nlog(n)−n.$$

Taking furthermore, also the logarithm from W and substituting $log(n!)$

by Stirling’s approximation, finally gives the definition of information entropy, as derived by Shannon’s theorem:

$$\frac1n log(W)\to \sum_{i=1}^m p_i log(p_i)=H(p_1,...,p_m)$$.

To sum it up: Entropy is a measure of uncertainty. The higher the entropy of a random variable X , the more uncertainty it incorporates. When the goal is to find a maximal ignorance distribution, this goal can be consequently translated into a maximization problem: Find the distribution with maximal entropy subject to existing constraints. This will be the topic of the next section.