I have been reviewing the literature on information theoretic methods in statistics, and in particular, the method of I-projections. Given a discrete, finite alphabet $\mathcal{X}$, let $\prod$ denote a set of probability mass functions (pmfs) on $\mathcal{X}$. Suppose $Q$ is a pmf on $\mathcal{X}$ which does not belong to $\prod$.
The I-projection of $Q$ on $\prod$ is defined as the element $P^{*}\in \prod$ such that
\begin{eqnarray}P^{*}=\arg \min \limits_{P\in \prod} D(P||Q),\end{eqnarray} where $D(P||Q)$ denotes the Kullback-Leibler divergence (KL-divergence) between pmfs $P$ and $Q$, defined as \begin{eqnarray}D(P||Q)=\sum\limits_{x\in \mathcal{X}} P(x)\log\left(\frac{P(x)}{Q(x)}\right).\end{eqnarray}
A sufficient condition for the existence of a $P^{*}$ as above requires $\prod$ to be closed. In addition, in the literature, $\prod$ is considered to be a convex set. For example, the family
\begin{eqnarray} \mathcal{P}=\lbrace P: \sum\limits_{x\in \mathcal{X}} P(x)f_{i}(x)=\alpha_{i}, ~1\leq i\leq k\rbrace, \end{eqnarray} where $f_{i}:\mathcal{X}\rightarrow \mathbb{R}$ are measurable functions and $\alpha_{i}\in \mathbb{R}$, called a linear family of pmfs, is typically considered in place of $\prod$ in the literature.
Question: Can someone please explain the significance of using convex sets? Do things fail if we do not consider convex sets?
The significance of having convex sets comes from statistical mechanics. This is mentioned at least in two places in Cover & Thomas's book (2nd edition); the first paragraph in Ch. 12 and Example 12.2.5.
However, convexity of $\mathcal{P}$ does not play any role for the existence of I-projection. Since $D(\cdot\|\cdot)$ is lower semi-continuous, the existence of I-projection is always guaranteed so long as $\mathcal{P}$ is compact.