I've been wondering about a simple question for few minutes, but I can't find any relevant questions or references by googling.
I want to minimize KL divergence from $p\in \Delta(A)$ where A is some discrete set to compact set $\Theta\subset \Delta(A)$ where $p \notin \Theta$. That is, I want to solve $$\min_{q\in\Theta}D(p||q)$$ Since $A$ is a discrete set, $\Theta, \Delta(A)$ is just a subset of $\mathbb{R}^{|A|}$.
Intuitively, since KL divergence measures "distance", I've conjectured that this would be minimized with $q$ closest to $p$ in Euclidean distance.
I'm not sure if I can prove it, or even if it is right at all. Does the following hold?
$$arg\min_{q\in\Theta}D(p||q)=arg\min_{q\in\Theta}d(p,q)$$ $d(p,q)$ is the standard Euclidean metric.
Any answers/counter examples/references would be helpful. Thank you!
What you are doing here is essentially an information projection. Your conjecture would thus mean that information projection is equivalent to an $\ell_2$-projection.
There is no reason why this should be the case. Take the following example:
Suppose we want to project $p= [\frac12, \frac16, \frac13]$ onto the family of distributions $\Theta = \{q_{\theta} = [\theta, \frac12-\theta, \frac12]: \theta \in (0, 1/2) \}$.
If you try to compute the corresponding projections, you will find that (solved numerically):