The Pythagoras theorem for information projections is given by: $$D(p||q) \geq D(p||p^*) + D(p^*||q)\;\; \forall p,$$ where $p^* = argmin_{r \in {\cal S}} D(r||q)$, or the projection of the distribution $q$ on the set ${\cal S}$. Also, $D(p||q) = \sum_y p(y) \log(\frac{p(y)}{q(y)})$ is the K-L divergence between two distributions $p$ and $q$. My question is about the proof of the following fact: $$D(p||q) = D(p||p^*) + D(p^*||q)\;\; \forall p \in {\cal L},$$ ie, the Pythagoras theorem holds with equality for all $p$ in ${\cal L}$, where the set ${\cal L}$ is a linear family, and $p^*$ is the projection of $q$ on ${\cal L}$.
In my course notes, they prove the above starting by using the fact that some distribution $s$ is in ${\cal L}$ for any $\lambda \in \mathbb{R}$, where,
$$ s = \lambda p + (1- \lambda) p^*,$$ since both $p$ and $p^*$ are in ${\cal L}$ and this is a property of linear families. Then, they prove it by contradiction using a small $\lambda < 0$ but I don't quite understand how they proceed. Could you please help me out? Thank you very much for your time!
There are a couple of key points:
Now, due to $1.$, there exists $t\in(0,1)$ s.t. if $\lambda \in (-t,t),$ $s_\lambda$ is a distribution (think why). We'll use this fact below. Let
$$f(\lambda) := D(s_\lambda\|q) - D(s_\lambda\|p^*) - D(p^*\|q).$$ Note that $f(\lambda) \ge 0$ whenever $s_\lambda$ is a distribution, and $f(0) = 0$. Since $s_\lambda$ is a distribution for $\lambda \in (-t,t),$ we have $f \ge 0$ on $(-t,t)$. Thus, $0$ is a local minima of $f$, and since $f$ is differentiable, you must have $$\left.\frac{\partial}{\partial \lambda} f(\lambda)\right|_{\lambda = 0} = \sum_{x\in \mathcal{X}} (p_x - p^*_x) \log \frac{p^*_x}{q_x} = 0,$$
where $\mathcal{X}$ is the support of $p^*$. Manipulating the above identity, we get, \begin{align*} &\sum_{x\in \mathcal{X}} (p_x - p^*_x) \log \frac{p^*_x}{q_x} = 0 \\ \Leftrightarrow &\sum_{\mathcal{X}} p_x \log \frac{p_xp^*_x}{p_xq_x} = \sum_{\mathcal{X}} p^*_x \log \frac{p^*_x}{q_x} \\ \Leftrightarrow &\sum_{\mathcal{X}} p_x \log \frac{p_x}{q_x} - \sum_{\mathcal{X}} p_x \log \frac{p_x}{p^*_x} = D(p^*\|q) \end{align*}
and we're done. Note that the main deal here was point $1,$ and the rest is just playing with definitions. While it isn't a terriby hard thing to prove, I'm too lazy to write a proof right now, but you can likely find it here.