I am interested in an upper bound on mutual information that I have been encountering frequently in the statistics and probability literature. I have yet to see the "purest" form of the inequality, so I will attempt to state it first, and then I will provide some literature that use leverage it. Ideally someone will be able to help me understand where this bound comes from and how it is derived.
Here is the idea: Let $M = \left\{\theta_1,\theta_2,\ldots,\theta_m\right\}$ be a set of "models". Additionally, let $\phi : \mathbb{R}^{n\times m} \to \left\{1,2,\ldots,m\right\}$. Indeed, $\phi\left(X^n\right)\in \left\{1,2,\ldots,m\right\}$ where $X^n$ may be thought of as a data (or design) matrix of $n$ observations. In other words, we are trying to recover, through the estimator $\phi$ one of the $m$ models which best describes our data.
I am interested in the following quantity: $I\left(\phi; X^n\right)$, where $I$ represents the mutual information. Then the following bounds $I$ from above: $$ I\left(\phi,X^n\right) \leq \frac{n}{m^2} \sum_{i=1}^m \sum_{j = 1}^m S\left(\mathbb{P}_{\theta_i} \left|\right| \mathbb{P}_{\theta_j}\right) $$ Here, $S\left(\cdot\left|\right|\cdot\right)$ denotes the symmetric KL-divergence. Further, $\mathbb{P}_{\theta_i}$ represents the underlying probability distribution for the model $\theta_i$.
This bound appears in Exploiting the Limits of Structure Learning via Inherent Symmetry and Information-theoretic bounds on model selection for Gaussian Markov random fields. In the former, see equation (2). In the latter, see equation (18). For the latter paper though, I believe there is a typo as there seems to be a factor of $2n$ missing. (Please correct me if I am wrong.) Therefore, it may be best to consult the first paper.
Equation (2) in the He and Zhang paper describes the problem quite differently: For each $m\in\{1, \ldots, M\}$ we have a joint distribution $p_m(x)$ for a random vector $X$. Now, $\phi$ is a uniform random variable in the set $\{1, \ldots, M\}$. Given $\phi=m$, form a matrix $X^n = [X_1, X_2; \ldots; X_n]$, where each of the $n$ rows is an iid vector $X_i$ taken from the distribution $p_m(x)$. So: \begin{eqnarray*} I(\phi; X^n) &=& H(X^n) - H(X^n|\phi) \\ &=& H(X_1; X_2; \ldots; X_n) - H(X_1; X_2; \ldots; X_n | \phi) \\ &\leq& H(X_1) + \ldots + H(X_n) - H(X_1; X_2; \ldots; X_n | \phi) \\ &=& H(X_1) + \ldots + H(X_n) - [H(X_1|\phi) + H(X_2|\phi) + \ldots + H(X_n|\phi)]\\ &=& \sum_{i=1}^n I(\phi; X_i) \\ &=& nI(\phi; X_1) \end{eqnarray*} where the first inequality is a standard bound on the entropy of a collection of random variables, and the equality below it (that is, the third-to-last equality above) follows because $\{X_1; X_2; \ldots; X_n\}$ are conditionally independent given $\phi$.
Now it suffices to prove a bound on $I(\phi; X_1)$. We have $I(\phi; X_1) = H(X_1) - H(X_1|\phi)$. Then: \begin{eqnarray*} H(X_1) &=& -\sum_x p(x)\log(p(x))\\ &=&-\sum_{\phi=1}^M\sum_x\frac{1}{M}p(x|\phi)\log\left(\frac{1}{M}\sum_{\phi'=1}^Mp(x|\phi')\right)\\ &\leq&-\sum_{\phi=1}^M\sum_x\frac{1}{M}p(x|\phi)\left[\frac{1}{M}\sum_{\phi'=1}^M\log(p(x|\phi')) \right]\\ &=&\frac{1}{M^2}\sum_{\phi'=1}^M\sum_{\phi=1}^M\left[\sum_x p(x|\phi)\log\left(\frac{1}{p(x|\phi')}\right)\right] \end{eqnarray*} where the inequality uses Jensen's inequality on the concave function $\log(\cdot)$. Next:
\begin{eqnarray*} H(X_1|\phi) &=& -\sum_{\phi=1}^M\sum_x \frac{p(x|\phi)}{M}\log(p(x|\phi))\\ &=&-\frac{1}{M^2}\sum_{\phi'=1}^M\sum_{\phi=1}^M \sum_x p(x|\phi)\log(p(x|\phi)) \end{eqnarray*}
Putting these two together gives: \begin{eqnarray*} I(\phi; X_1) &=& H(X_1) - H(X_1|\phi) \\ &\leq& \frac{1}{M^2}\sum_{\phi'=1}^M\sum_{\phi=1}^M\sum_x p(x|\phi)\log\left(\frac{p(x|\phi)}{p(x|\phi')}\right)\\ &=& \frac{1}{M^2}\sum_{\phi'=1}^M\sum_{\phi=1}^MD_{KL}\left(p(x|\phi) || p(x|\phi')\right) \end{eqnarray*} where $D_{KL}(\cdot)$ is the Kullback-Leibler divergence.
PS: If you want, you can express this in the way given in the He and Zhang paper:
\begin{eqnarray*} &&\frac{1}{M^2}\sum_{\phi'=1}^M\sum_{\phi=1}^MD_{KL}\left(p(x|\phi) || p(x|\phi')\right)\\ &=&\frac{1}{M^2}\sum_{\phi'=1}^M\sum_{\phi=1}^{\phi'-1} \left[D_{KL}\left(p(x|\phi) || p(x|\phi')\right) + D_{KL}\left(p(x|\phi')||p(x|\phi)\right)\right] \end{eqnarray*} with the understanding that $D_{KL}\left(p(x|\phi)||p(x|\phi)\right)=0$.