In Shannon's paper "A Mathematical Theory of Communication," section I.7 "The Entropy of an Information Source," he discusses some results regarding the entropy $H$.
In theorem 3 he states the following
Theorem 3: Given any$\epsilon$ > 0 and$\delta$ > 0, we can find an $N_0$ such that the sequences of any length $N \ge N_0$ fall into two classes:
- A set whose total probability is less than $\epsilon$.
- The remainder, all of whose members have probabilities satisfying the inequality $$\big| \frac{\log p^{-1}}{N} - H \big| < \delta$$
Then he introduces theorem 4 as below
A closely related result deals with the number of sequences of various probabilities. Consider again the sequences of length $N$ and let them be arranged in order of decreasing probability. We define $n(q)$ to be the number we must take from this set starting with the most probable one in order to accumulate a total probability $q$ for those taken.
Theorem 4: $$\lim_{N \to \infty} \frac{\log n(q)}{N} = H$$ when $q$ does not equal $0$ or $1$.
I don't understand the definition of $n(q)$ here and what he meant by accumulating a total probability $q$, did he mean adding up the probabilities of the sequences starting with the most probable one to the least until we reach $q$? Or did he mean multiplying them to get the probability of the intersection of these sequences (assuming independence)?
Also, what does the function $n(q)$ tell us about the information source? And what does this theorem imply?
I'd also appreciate a proof of theorem 4 if possible.