Suppose that sequence $Z_1,Z_2,\dots,Z_n$ is drawn i.i.d. from some distribution (pmf) $Q_Z$. Prove that the probability that it's typical with respect to some other distribution $P_Z$ is roughly $2^{-nD(P_Z||Q_Z)}$. Note that $D(.||.)$ denotes Kullback–Leibler divergence and the typical set with respect to $P_Z$ is defined as (I'm not sure that the definition is correct): $$T_\epsilon^{(n)} =\{z_1^{n} : H(P_Z)-\epsilon\le-\frac{1}{n}\log_2P_{Z_1,Z_2,\dots , Z_n}(z_1,z_2,\dots,z_n)\le H(P_Z)+\epsilon\}$$ So we want to show that $$P[z_1^n\in T_\epsilon^{(n)}]\approx 2^{-nD(P_Z||Q_Z)}$$In the case that $Q_Z$ is the same as $P_Z$, we can invoke WLLN to show that the probability of typical set is nearly $1$. In this case, I don't know how to apply WLLN. Also I don't know whether $$P_{Z_1,Z_2,\dots , Z_n}(z_1,z_2,\dots,z_n) = P_{Z_1}(z_1)P_{Z_2}(z_2)\dots P_{Z_n}(z_n)$$ holds because $P_Z$ is not the true distribution.
Probability that a sequence is typical with respect to some distribution
111 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 2 best solutions below
On
I'm not 100% certain whether this helps, but I'm pretty sure that result in your problem belongs to a field called "Large Deviation Theory." There is a a statement called Sanov's Theorem that characterizes this probability.
Roughly speaking, Sanov's Theorem states that the probability that your empirical distribution lies in some set is exactly proportional to exponential of the negative projection of the KL Divergence, so something like $e^{-n D(P || Q)}.$
Maybe look into that, and see if it helps? Check out this source: https://www.cs.cmu.edu/~aarti/Class/10704_Fall16/lec23.pdf
You can see that the statement of Sanov's Theorem on the first page has a form very similar to the one you want to obtain, so I suspect that a simple application of Sanov's Theorem should get you what you need.
This turns out to be untrue for the set of weakly typical (or entorpy-typical) sequences - the upper bound on $Q^n(\mathcal{T})$ need not hold. The argument below uses some ideas from the method of types.
Let $R$ be a law such that $$ \sum (R(a) - P(a)) \log P(a) = 0; D(R\|Q) < D(P\|Q).$$
Then observe that sequences of type approximately $R$ are part of the $\varepsilon$-weakly-typical set of $P$ (see the argument below). But the chance of finding a sequence which is roughly of $R$ type is roughly $\exp( - nD( R\|Q)) \gg \exp(- n D(P\|Q)).$ Thus, the chance of drawing a $P$-weakly-typical set under $Q$ is considerably larger than $\exp( - nD(P\|Q)).$
Such laws indeed exist (see [1] below), violating the upper bound on the $Q^n$ mass of the $P$-weakly typical set.
We can show a lower bound of this required form, however: it holds that $$ Q(\mathcal{T}^{n, P}_\varepsilon) \ge \exp( - n ( D(P\|Q) - O(\varepsilon)).$$ To show this, write $$ Q(x^n) = \frac{Q(x^n)}{P(x^n)} \cdot P(x^n) = \exp( n \log P(x^n) + \sum n \hat{P}_{x^n}(a) \log Q(a)/P(a)),$$ where $\hat{P}_{x^n}$ is the empirical law induced by $x^n$. For weakly typical sequences, the first term is $- n(H \pm \varepsilon)$. Further define the set $\mathcal{G}^{n,P}_\varepsilon := \{x^n: \sum \hat{P}_{x^n}(a) \log Q(a)/P(a) \ge - D(P\|Q) - \varepsilon\}.$ You can show using the WLLN and the fact that $P^n(\mathcal{T}^{n,P}_\varepsilon) \ge 1-\varepsilon$ that $|\mathcal{T}^{n,P}_\varepsilon \cap \mathcal{G}^{n,P}_\varepsilon| \ge 2^{-(n+2)\varepsilon}|\mathcal{T}^{n,P}|,$ which suffices for the argument (do this!).
Note also that that there are alternate notions of typicality (strong typicality, robust typicality, etc.) that do satisfy the statement of the question (i.e., both in terms of upper and lower bounds). Part of the reason for defining such notions of typicality is precisely the convenience yielded by such results. See, e.g., Csiszar and Korner, Ch. 2. for a brief discussion (this book heavily uses the method of types, and mostly works with strong-typicality, sometimes called P-typicality).
Let me finish with a less hand-wavy argument.
For a natural $n$, a type is a probability distribution of the form $(k_1/n, \dots, k_{|\mathcal{A}|}/n)$, where the $k_i$ are all natural numbers. A $n$-length sequence $x^n$ is of type $T$ if its empirical law is $T$. Note that weak typicality can be defined in terms of type: since $P^n(x_1^n) = \prod P(x_i) = \exp( \sum_{a} \hat{P}(a)\log P(a)),$ where $\hat{P}(a)$ is the empirical law induced by $x^n,$ a sequence $x^n$ is weakly typical iff its type satisfies $ | \sum (\hat{P}(a) - P(a))\log P(a)| \le \varepsilon.$
For given $n$ and type $T$, define the type-T set as $\mathcal{T}^{T,n} := \{x^n : x^n \textrm{ is of type } T\}.$ It is a standard calculation of the method of types (see, e.g., Cover and Thomas Ch. 11), that for any law $Q, Q^n(\mathcal{T}^{T,n}) = \exp( -n (D(T\|Q) - \kappa_n) ),$ where $0 \le \kappa_n \le |\mathcal{A}|\log n/n$. Similar calculations also show up in the notes that Alan Chung linked in their answer.
Things will depend a little on exactly what you define $\approx$ to be, but let me take it as $Q^n(E_n) \approx 2^{-n\rho}$ if for every $\varepsilon > 0,$ for all large enough $n$, we have $ -\rho + \varepsilon \ge \frac{1}{n} \log Q^n(E_n) \ge -\rho - \varepsilon.$
Now take the $P,Q,R$ defined below. Fix any positive $\varepsilon < 50[D(P\|Q) - D(R\|Q)]/52.$ For any $n$, define the type $T_n$ such that $T_n(1) = \lfloor n R(1)\rfloor/n, T_n(2) = \lfloor nR(2)\rfloor/n, T_n(3) = 1-T_n(2) - T_n(1).$
Then for all $n$ large enough, by continuity, we have $|D(T_n\|Q) - D(R\|Q)| < \varepsilon/100.$ Further, choose $n$ so large that $\kappa_n < \varepsilon/100.$ By the above calculation, we find that $Q^n(\mathcal{T}^{T_n,n}) \ge \exp( - n (D (R\|Q) + \varepsilon/50)).$ But, since $|T_n(i) - R(i)| \le 1/n,$ we have that $|\sum T_n(i) - P(i) \log P(i)| \le 3\log(6)/n.$ For large $n$, this is much smaller than $\varepsilon,$ which means that any sequence of type $T_n$ is $P$-weakly typical.
We thus conclude that \begin{align} Q^n(\mathcal{T}^{n,P}_\varepsilon) &\ge Q^n(\mathcal{T}^{T_n,n}) \ge \exp( - n(D(R\|Q) + \varepsilon/50)) \\ &\ge \exp( - n(D(P\|Q) - (D(P\|Q) - D(R\|Q)) + \varepsilon/50)) \\ &\ge \exp( - n(D(P\|Q) - \varepsilon (51/50)).\end{align}
In other words, we find that for all small enough $\varepsilon,$ we can find an $n$ such that $\frac{1}{n} \log Q^n(\mathcal{T}^{n,P}) > -D(P\|Q) + 51/50 \varepsilon,$ violating the above definition. All reasonable such definitions are similarly violated upon considering $\varepsilon$ small enough, basically exploiting that $D(P\|Q) - D(R\|Q)$ is a constant.
[1]: An explicit example of $P,Q,R$ of the required type is as follows. pick $P = (1/2,1/3, 1/6), Q = (1/2, 1/4, 1/4).$ Denote $R$ as $(x,y, 1-x-y)$. The first condition becomes $$ x \log(2) + y \log(3) + (1-x-y) \log 6 = \log(2)/2 + \log(3)/3 + \log(6)/6.$$ Using $\log(6) = \log(2) + \log(3),$ this simplifies to $$ (1/3-y) \log(2) + (1/2-x) \log(3) = 0 \iff x = \frac12 + (1/3 - y) \cdot\frac{\log(2)}{\log(3)}$$
The second condition is $$ x \log 2x + y \log 4y + (1-x - y)\log(4(1-x-y)) < \log(4/3)/3 + \log(4/6)/6.$$ Solutions exist by considering $$f(y) = x \log 2x + y \log 4y + (1-x - y)\log(4(1-x-y)) \\ - \log(4/3)/3-\log(4/6)/6,$$ under the above constraint on $x$. It can be shown that $f(1/3) = 0$ and $f'(1/3) = \log 2 \log(4/3)/\log(3) > 0,$ meaning that slightly decreasing $y$ yields an example. Of course, one can just explicitly solve via a CAS to find that, for instance, $(0.6, 0.17483..., 0.4-0.17483...)$ is a valid example.