Why does relative entropy decrease under pushforward?

696 Views Asked by At

I am reading the paper at https://arxiv.org/abs/1006.3028 (J. Lehec, "Representation formula for the entropy and functional inequalities"). The main concept here is the relative entropy of the probability measures $\mu$ and $\gamma$, defined as $$H(\mu | \gamma)=\int \log\left( \frac{d\mu}{d\gamma}\right) d\mu, $$ or $+\infty$ if $\mu$ is not absolutely continuous with respect to $\gamma$ (that is, the density $\frac{d\mu}{d\gamma}$ does not exist). This is also known as the Kullback-Liebler divergence.


Remark on sign conventions. This definition seems to be more common of information theory. With this definition, $H(\mu| \gamma)$ is a nonnegative convex function of $\mu$. The common physicist's definition, on the other hand, has the opposite sign; it is thus a nonpositive concave function of $\mu$.


The first inequality in the second section reads $$\tag{1} H(\mu\circ T^{-1} | \gamma\circ T^{-1})\le H(\mu | \gamma)$$ for all measurable maps $T$.

Main question. What is the fastest proof of (1)?

Following the references in the paper I actually found a proof. In the book "Large deviations and applications" of Varadhan (reference [24], section 10) I see that the relative entropy can be characterized as $$ H(\mu|\gamma)=\inf\left\{ c\,:\, \int F\, d\mu \le c + \log \int e^F\, d\gamma,\ \forall F \text{ bounded and measurable}\right\}.$$ Using this characterization, (1) follows. I wonder if there is a way to avoid the characterization, though.


NOTE. The characterization is an immediate consequence of the convex duality described in this question, which is an application of the Jensen inequality.


Secondary question. The word "entropy" makes me think of the second principle of thermodynamics, and it suggests some quantity that is monotonic in time. Now, the map $\mu\mapsto \mu\circ T^{-1}$ can be interpreted as a step in time for the discrete dynamical system $x\mapsto T(x)$. Can (1) be seen as a version of the second principle of thermodynamics for such discrete systems?

2

There are 2 best solutions below

5
On BEST ANSWER

Suppose $\mu$ and $\gamma$ are probability measures on $(X,\mathscr{F})$, $\mu\ll\gamma$, and $T:(X,\mathscr{F})\rightarrow(Y,\mathscr{G})$ measurable.

Then of course $\mu\circ T^{-1}\ll\gamma\circ T^{-1}$, for $\gamma\circ T^{-1}(A)=\gamma(T^{-1}(A))=0$, implies $\mu(T^{-1}(A))=\mu\circ T^{-1}(A)=0$.

Claim:
$$ \mathbb{E}_\gamma\Big[\frac{d\mu}{d\gamma}\big|\sigma(T)\Big]=\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}\circ T$$

Let $h:(Y,\mathscr{G})\mapsto(\mathbb{R},\mathscr{B}(\mathbb{R})$ be a measurable function such that $E_\gamma\Big[\frac{d\mu}{d\gamma}\big|\sigma(T)\Big]=h\circ T$ (any function $\phi$ that is measurable with respect to $\sigma(T)$ admits a representation for the form $\phi=h_\phi\circ T$ for some measurable function $h$ on $Y$). Then, for any $B\in\mathscr{G}$, $$\begin{align} \int_Y \mathbb{1}_B\,\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}\, d(\gamma\circ T^{-1})&=\int_Y \mathbb{1}_B\,d(\mu\circ T^{-1})=\int_X \mathbb{1}_B\circ T\,d\mu\\ &=\int_X\mathbb{1}_{T^{-1}(B)}\frac{d\mu}{d\gamma}\,d\gamma=\int_X\big(\mathbb{1}_{B}\circ T \big)\,\mathbb{E}_\gamma\Big[\frac{d\mu}{d\gamma}\big|\sigma(T)\Big]\,d\gamma\\ &=\int_X \big(\mathbb{1}_B\circ T\big)\, h\circ T\,d\gamma =\int_Y\mathbb{1}_B\,h\,d(\gamma\circ T^{-1}) \end{align} $$ This proves that (1) $(\gamma\circ T^{-1})$-almost surely $\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}=h$, and so, (2) $\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}\circ T=\mathbb{E}_\gamma\big[\frac{d\mu}{d\gamma}\big|\sigma(T)\big]$
$\Box$

Let $\eta(x)=x \log(x)\mathbb{1}_{(0,\infty)}(x)$ on $[0,\infty)$. It is easy to check that $\eta$ is convex on $[0,\infty)$ , and that for any pair of measures $\mu$, $\gamma$ with $\mu\ll\gamma$ $$H(\mu|\gamma):=\int_X\log\big(\frac{d\mu}{d\gamma}\big)\,d\mu=\int_X\log\big(\frac{d\mu}{d\gamma}\big)\,\frac{d\mu}{d\gamma}\,d\gamma=\int_X\eta\big(\frac{d\mu}{d\gamma}\big)\,d\gamma$$ Finally, applying Jensen's inequality to conditional expectations yields

$$\begin{align} H(\mu\circ T^{-1}|\lambda\circ T^{-1})&=\int_Y\eta\left(\frac{d\mu\circ T^{-1}}{d\gamma\circ T^{-1}}\right)\,d(\gamma\circ T^{-1})\\ &=\int_X\eta\Big(\frac{d\mu\circ T^{-1}}{d\gamma\circ T^{-1}}\circ T\Big)\,d\gamma\\ &=\int_X\eta\Big(\mathbb{E}_\gamma\big[ \frac{d\mu}{d\gamma}\big|\sigma(T)\big]\Big)\,d\gamma\\ &\leq\int_X\mathbb{E}_\gamma\big[ \eta\big(\frac{d\mu}{d\gamma}\big)\big|\sigma(T)\big]\,d\gamma\\ &=\int_X\eta\big(\frac{d\mu}{d\gamma}\big)\,d\gamma=H(\mu|\gamma) \end{align}$$ which is the desired inequality.

0
On

This is more a comment that an answer intending to address question 2 through a rather rudimentary presentation of the thermodynamics formalism. Others are welcome to contribute.

Postulate A. A Thermodynamic system is is equivalent to a measure space $(X,\mathscr{B},\mu)$. $X$ is called the phase space; $\mu$ is a $\sigma$-finite measure.

A dynamical law (or rather and autonomous time dynamical law) in a thermodynamical system is described by a collection of measurable transformations $S=\{S_t:t\in\mathbb{T}\}$. The index set $\mathbb{T}$ denoting time may be either discrete ($\mathbb{Z}$ or $\mathbb{N}\cup\{0\}$ for example) or continuous ($\mathbb{R}$ or $[0,\infty)$). The dynamical law satisfies the following semi group properties:

  1. $S_0(x)=x$ for all $x$
  2. $S_{t+t'}(x)= S_t(S(_{t'}(x))$ for all $t,t'\in\mathbb{T}$ and $x\in X$.
  3. When $\mathbb{T}=\mathbb{Z}$ or $\mathbb{T}=\mathbb{R}$, then the system $S$ is invertible to a time reversible system: $S_{t}\circ S_{-t}=S_0=S_{-t}\circ S_t$. If $S$ is such that not all $S_t$ are invetible ($\mathbb{T}=\mathbb{N}\cup\{0\}$ or $\mathbb{T}=[0,\infty)$ we say that $S$ is a noninvertible system (one can't go back on time)

For any $x\in X$, $\{S_t(x):t\in\mathbb{T}\}$ is called the trajectory of $x$. To study the way in which the dynamics changes over time one may consider the individual trajectories of each point $x$ in the phase space; or, as in Ergodic theory, on can study the way in which the dynamics affect infinite number of points. This is done, in probabilistic terms, by studying how the system alters densities. A density $f$ is a measurable function $f\geq0$ such that $\int_Xf\,d\mu=1$.

Postulate B. A thermodynamic system has, at any given time $t$, a state characterized by a density $f_t$.

At any given time, for any $A\in\mathscr{B}$ $$\mu_t(A)=\int_A f_t(x)\,\mu(dx)$$ denotes the probabilty that at time $t$ the state of the system is in $A$. Typically $$\int_X(\mathbb{1}_{A}\circ S_t)\, f_0\,d\mu=\int_X\mathbb{1}_A\, f_t\,d\mu$$

An observable $\mathcal{O}$ is a measurable function $\mathcal{O}:X\rightarrow\mathbb{R}$. $\mathcal{O}(x)$ characterizes some aspect of the thermodynamic system. The average value of the observable at time $t$ $$\langle \mathcal{O}\rangle_{f_t} = \int_X\mathcal{O}(x)f_t(x)\,\mu(dx)$$ If for some density $f$ the dynamical law is $f\cdot\mu$-invariant, i.e. $\int_X\mathbb{1}_A\circ (S^t)\, f\,\mu=\int_A\mathbb{1}_A\,f\,d\mu$ for all $t\in\mathbb{T}$ and $A\in\mathscr{B}$ then one expects some ergodicity properties: $$\lim_{t\rightarrow\infty}\frac{1}{t}\int^t_0 g\circ S_u\,du =\int_X g\,f d\mu\qquad f\cdot d\mu-\text{a.s.}$$ Such $f$ describes a state of thermodynamic equilibrium.

In his celebrated work Gibbs introduced the concept of index of probability for a system in state $\{f_t:t\in\mathbb{T}\}$ as $\log(f_t)$. Now the quantity \begin{align} H(f_t):=-\int_X\log(f_t(x))f_t(x)\,\mu(dx)\tag{0}\label{BG} \end{align} is called Boltzman-Gibbs entropy of the density $f_t$. To illustrate the intuition behind this quantity, Suppose there are two termodinamical systems $(X_j,\mathscr{F}_j,\mu_j)$, $j=1,2$, and each one having states (densities) $f^j$. We combined these two system to form the system $(X_1\times X_2,\mathscr{F}_1\otimes\mathscr{F}_2,\mu_1\otimes\mu_2)$ with densities $f(x_2,x_2)=f^1(x_1)f^2(x_2)$ (all these means that systems 1 and 2 do not interact with each other). Then its is expected that the entropy of the combined system equals the sum of the entropies of systems 1 and 2. It is easy to check that definition \eqref{BG} satisfies this property.

Formally, the Boltzmann-Gibbs entropy of the density $f$ w.r.t. $\mu$ is defined as $$H(f)=\int_X\eta(f(x))\,\mu(dx),\qquad \eta(w)=-w\log(w)\mathbb{1}_{(0,\infty)}(w)$$ The function $\eta$ is concave ($-\eta$ is convex) over $[0,\infty)$ and so, $$\eta(w)\leq(w-v)\eta'(v)+\eta(v)=-w\log v-(w-v),\qquad w,v>0$$ Then, for any pair of densities $f$ and $g$ such that $\eta\circ f$ and $\eta\circ g$ are $\mu$ intergable (i.e., in $L_1(\mu)$) we have that $$\begin{align} -\int_Xf(x)\log(f(x))\,\mu(dx)\leq -\int_Xf(x)\log(g(x))\,\mu(dx)\tag{1}\label{gibbs-ineq} \end{align}$$

  • It follows from \eqref{gibbs-ineq} that if $\mu(X)<\infty$, then the density $f_*(x)=\frac{1}{\mu(X)}$ maximizes the entropy amongst all densities. The density $f_*$ is a generalization of what Gibbs called the microcanonical ensemble.
  • If $\nu$ and $\gamma$ are probability measures and $\nu\ll\gamma$, the relative entropy of $\nu$ relative to $\gamma$ is defined as $$H(\nu|\gamma):=\int_X\log\big(\frac{d\nu}{d\gamma}\big)\,d\nu= \int_X\log\big(\frac{d\nu}{d\gamma}\big)\frac{d\nu}{d\gamma}\,d\gamma=-\int_X\eta\big(\frac{d\nu}{d\gamma}\big)\,d\gamma$$ Since $-\eta$ is convex, $$H(\nu|\gamma)=-\int_X\eta\big(\frac{d\nu}{d\gamma}\big)\,d\gamma\geq -\eta\Big(\int_X\frac{d\nu}{d\gamma}\,d\gamma\Big)=-\eta(1)=0$$ If in addition, $\gamma\ll\mu$ and $d\nu=f\,d\mu$, $d\gamma=g\,d\mu$ $$H(f|g):=H(f\,d\mu| g\,d\mu):=\int_X\log\big(\frac{f(x)}{g(x)}\big)\,f(x)\,\mu(dx)=-\int_X\eta\big(\frac{f}{g}\big)\,g\,d\mu $$ In statistics, $H(\nu|\gamma)$ is known as the Kullback-Liebler divergence and is denoted as $K(\nu|\gamma)$.

When $\mu$ is not finite, there are no entropy maximizing densities. However, under some additional constrains we can find densities that maximize entropy. More concretely, suppose that for real constants $c_1,\ldots,c_k$, and observables $\mathcal{O}_1,\ldots,\mathcal{O}_k$, there are constants $\nu_1,\ldots, \nu_k$ such that

  1. $\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\in L_1(\mu)$,
  2. $c_j=Z^{-1}\int_X\mathcal{O}_j\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\,d\mu$ for each $j=1,\ldots,k$, where $Z=\int_X\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\,d\mu$.
  • Then, it follows from another application of \eqref{gibbs-ineq} that the density $$\begin{align} f_*=\frac{1}{Z}\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\tag{2}\label{gibbs-2} \end{align}$$ maximizes the entropy $f\mapsto H(f)$ among all densities such that $c_j=\langle \mathcal{O}_j\rangle_f$.

The normalizing factor $Z$ is known as the partition function, and the density $f_*$ generalizes the canonical ensamble of Gibbs. In defining the canonical enables notice that time does not appear.

Postulate C: There exists a one-to-one correspondence between states of thermodynamic equilibrium and the states of maximal entropy.

Postulate D: Given a (nonnegative) observable $\mathcal{O}$ and a constant $c>0$, the entropy maximizing density given by \eqref{gibbs-2} satisfying $c=\langle \mathcal{O}\rangle_{f_*}$ corresponds to a state of thermodynamic equilibrium attained physically.

If there is only one state of thermodynamic equillibriumthat is attained regardless if the way in which the system starts then this is called the a globally stable equilibrium (this is related to the second law of thermodynamics).