What's the intuition behind typical sequences?

103 Views Asked by At

Given a probability distribution $x\mapsto p(x)$, some integer $n>0$, and some $\epsilon>0$, one defines $\epsilon$-typical sequences as those sequences $\boldsymbol{x}\equiv(x_1,...,x_n)$ $$\left|-\frac1 n \log[p(\boldsymbol{x})] - H(X) \right| < \epsilon,$$ where $p(\boldsymbol{x})\equiv \prod_{k=1}^n p_X(x_k)$.

My rough intuition for typical sequences was that they are the ones containing $np_i$ copies of the $i$-th symbol, for each $i$, and this intuition was backed-up by the fact that $$\binom{n}{np}p^{np} (1-p)^{n(1-p)} \to 1, \qquad n\to\infty, \\ p^{np} (1-p)^{n(1-p)} = 2^{-nH(X)}.$$ However, I recently realise that, for binary sequences, when $p_0=p_1=1/2$, going by the definition of typical sequences, all binary sequences are typical, regardless of $\epsilon$ and $n$, as $p^{np}(1-p)^{n(1-p)}=2^{-n}=2^{-n H(X)}$.

So what gives? If the intution that typical sequences are those with $np_i$ copies of each symbol tenable at all? Does it only fail for balanced probability distributions, or are there other "edge cases" to be mindful of?

2

There are 2 best solutions below

1
On

In the maximal entropy case [$p=1/2$ for binary] all sequences are typical, it is a feature of the definition.

If $p$ is far from $1/2$ say in $(0,1/4)$ then there are way fewer sequences and you have compressibility, after all with high probability you only have to encode $2^{nH(p)}$ sequences, the rest will occur with tiny probability $\epsilon$, say, and you just use a header symbol and don't encode those. This will make little difference to the expected length of your source code.

And yes, your intuition is correct in that.

  1. You can define what is called "strong typicality" directly via empirical symbol proportions being very close to $n p_i.$
  2. The definition given in your question is sometimes called "weak typicality" but is easier to handle mathematically, since we want to work with the entropy function directly without manipulating the multinomial distribution each time which is messy.

However you can prove stronger results using strong typicality.

Edit: See, for example, section 10.6 in Cover and Thomas' book Elements of Information Theory (2nd ed.) entitled Strongly Typical Sequences and Rate Distortion. Essentially tight two sided (both lower and upper) convergence bounds can be proved for quantities of interest. I copied one result below:

enter image description here

enter image description here

1
On

My rough intuition for typical sequences was that they are the ones containing $n p_i$ copies of the $i-$th symbol

That would be a different kind of typicality, where the empirical frequencies (samples) are near the real frequencies (probabilities). But we don't require that, we requre that the "empirical" or sample entropy $\frac1 n \sum \log(1/p(x_i))$ is near the real entropy $H(X)= E[\log(p(X))]$.

Related: In Cover and Thomas textbook, in exercise 3.4 is studied some relationship between two typical sets: the sequences that are typical with respect to the entropy and another with respect to the mean.

See also here.


Edit: Some more elaboration

Let $X$ take values on $1,2,3 \cdots$, let $p_i=P(X=i)$. Let $\hat p_i=n_i/n$ be the empirical frequency of value $i$ in a $n-$sized sample $\mathcal X=(x_1, x_2 \cdots x_n)$.

Let $\hat H= \frac{1}{n}\sum_{k=1}^n \log(1/p(x_k))$ be the "sample entropy" and $\Delta_H=\hat H- H$.

Hence the condition for the realization to be typical is $\mathcal X \in A^n_{\epsilon} \iff |\Delta_H| \le \epsilon$

Now, letting $\ell_i = \log(1/p_i)$,

$$ \Delta_H = \frac{1}{n} \sum_i n_i \ell_i- \sum_i p_i \ell_i = \sum_i \delta_i q_i \tag 1$$

where $\delta_i = \hat p_i - p_i$ is the discrepancy between the empirical probability (histogram) and the real probability mass distribution. Then, by Cauchy-Schwarz inequality:

$$ |\Delta_H|^2 \le (\sum \delta_i^2) (\sum \ell_i^2) \tag 2$$

This implies that, if the realizaiton is "typical with respect to the empirical frequencies" (so that $\sum \delta_i^2 \le \epsilon'$), then it's also typical wrt the entropy (with other $\epsilon = \sqrt{\epsilon' (\sum \ell_i^2) }$).

But, as your counterxample already showed, it's not true the other way. In particular, if $\ell_i$ is constant then automatically $\Delta_H=0$ .