maximum information rate probability distribution

69 Views Asked by At

I recently finished J.R. Pierce's "An Introduction to Information Theory", and in chapter XII, he states the following (pp. 239):

"It is an easy mathematical task to show that if a speaking time $t_r$ is associated with the $r$th word in order of commonness, then for a message composed of randomly chosen words the information rate will be greatest if the $r$th word is chosen with a probability $p(r)$ given by $$p(r) = 2^{-ct_r}$$ Here $c$ is a constant chosen to make the sum of the probabilities for all words add up to unity."

Basically, you have a list of words (not necessarily finite), listed in the order of commonness or rank, so that the first word ($r=1$) might be "the", the most common word, and a word further down in the list (e.g. $r=100$) may be "chronology" or some other less common word. Associated with each word is the speaking time, $t_r$, which is the time required to speak the $r$th word.

I would like to know how he arrived at this pmf. I can see that this is a valid pmf with $c$ chosen correctly, since $$\lim_{c\to-\infty}\sum_{r=1}^{n} p(r) = \infty$$ and $$\lim_{c\to\infty}\sum_{r=1}^{n} p(r) = 0$$ so that there must be some $c \in \mathbb{R}$ such that the given pmf is valid, i.e. probabilities add up to unity. Is there a closed form solution for finding $c$? Here I've assumed a finite list of words.

More importantly, how does he arrive at this pmf? What is this so-called "easy mathematical task"? This seems reminiscent of the zeta or Zipf distribution, but I'm not sure.