Is entropy (or relative entropy) function smooth in the sense that it is gradient is lipschitz continuous?

952 Views Asked by At

I wonder if the relative entropy function satisfies the uniform smoothness condition? Let $x \in \mathbb{R}^n_+$ be a probability distribution and $f: \mathcal{X}\rightarrow \mathbb{R}$ such that $f(x)=\sum_{i=1}^n x_i log(x_i)$. The function $f$ is called $L_f-$smooth if:

\begin{equation} f(y) \leq f(x) + \langle \nabla f(x), y-x \rangle + \frac{L_f}{2} \|y-x\|^2, \qquad \forall x,y \in \mathcal{X}, \end{equation} where $\mathcal{X} = \{x \in \mathbb{R}^n_+: \sum_i x_i = 1 \}$. I'm unable to show whether or not the (negative) entropy function satisfies this smoothness condition for some $L_f$. I have the same question for relative entropy function, i.e., KL-divergence, $KL(x,y) = \sum_{i=1}^n x_i log(\frac{x_i}{y_i})$.

1

There are 1 best solutions below

0
On

To your first question, as to whether the negative entropy $f$ satisfies the condition over $\mathcal{X}$: it doesn't.

Let's expand the RHS of the inequality. $\frac{\partial f}{\partial x_{i}}=\log x_{i}+1$ for all $i$, so that $$\langle \nabla f(x), y-x \rangle=\sum_{i=1}^{n}(y_{i}-x_{i})(\log x_{i}+1)=-f(x)+\sum_{i=1}^{n}y_{i}\log x_{i}.$$ Note we use above that $\sum_{i=1}^{n}(y_{i}-x_{i})=0$, as $y$ and $x$ are both probability distributions. The RHS is then $\sum_{i=1}^{n}y_{i}\log x_{i}+\frac{L_{f}}{2}||y-x||^{2}.$ Subtracting the sum $\sum_{i=1}^{n}y_{i}\log x_{i}$ tells us that the inequality you're hoping for is equivalent to $$KL(y, x) \leq \frac{L_{f}}{2}||y-x||^{2}$$ where $KL(y,x)=\sum_{i=1}^{n}y_{i}\log \frac{y_{i}}{x_{i}}.$ Since $x$ and $y$ are probability distributions, $||y-x||^{2} \leq (||x||+||y||)^{2} \leq 4n^{2}$, so we'd at least need $KL(y,x) \leq 2n^{2}L_{f}$ for some $L_{f}$ if this were to be true. However, it is easy to see that $KL(y,x)$ can be made arbitrarily large. For instance, with $n=2$, take $y=\left(\frac{1}{2},\frac{1}{2}\right)$ and $x=(x_{1},x_{2})$ with $x_{2}=1-x_{1}$ and $x_{1} \in [0,1].$ Then $KL(y,x) \to \infty$ as $x_{1} \to 0$ (and as $x_{1} \to 1$). Similar counterexamples arise for $n \geq 3.$

As far as KL-divergence satisfying it, I'm not sure exactly how you want to approach it. The divergence could be considered a function from $\mathcal{X} \to \mathbb{R}$ if you fix some $y \in \mathcal{X}$ and let $x \mapsto K(x,y)$ or a function from $\mathcal{X} \times \mathcal{X} \to \mathbb{R}$ where $(x,y) \mapsto K(x,y).$