I'm trying to understand the proof that $I(X;Y)$ is convex in conditional distribution $p(y \mid x)$ - from Elements of Information Theory by Cover & Thomas, theorem 2.7.4.
In the proof we fix $p(x)$ and consider two conditionals $p_1 (y \mid x)$ and $p_2 (y \mid x)$. Corresponding joints are $p_1(x, y) = p(x) \, p_1 (y \mid x)$ and $p_2(x, y) = p(x) \, (y \mid x)$ and marginals are $p_1(y)$ and $p_2(y)$.
Then we consider conditional $p^*(y \mid x) = \lambda \, p_1 (y \mid x) + (1 - \lambda) \, p_2 (y \mid x)$, joint $p^*(x, y) = \lambda \, p_1 (x, y) + (1 - \lambda) \, p_2 (x, y) = \lambda \, p(x) \, p_1 (y \mid x) + (1 - \lambda) \, p(x) \, p_2 (y \mid x)$ and marginal $p^*(y) = \lambda \, p_1 (y) + (1 - \lambda) \, p_2 (y)$.
If we let $q^*(x, y) = p(x) \, p^*(y) = \lambda \, p(x) \, p_1 (y) + (1 - \lambda) \, p(x) \, p_2 (y)$, then the KL divergence between $p^*(x, y)$ and $q^*(x, y)$ is $D \big( p^*(x, y) \ || \ q^*(x, y) \big) = D \big( p^*(x, y) \ || \ p(x) \, p^*(y) \big) = I(X; Y)$ - for $X \sim p(x)$ and $Y \sim p^*(y)$
Next in the book they conclude the proof by saying that since $D( p \ || \ q)$ is convex in $(p, q)$, so is $I(X;Y)$. What I don't understand is why it means that $I(X; Y)$ is convex in $p(y \mid x)$?
That $D(u||v)$ is convex in the pair $(u,v)$ means that
$$D(u_{\lambda}||v_{\lambda}) \le \lambda D(u_{1}||v_{1}) +(1-\lambda)D(u_{2}||v_{2}) $$
(here $u$ and $v$ and some probability functions, with $u= \lambda u_1 +(1-\lambda) u_2$, etc)
Then $$I(X_{\lambda};Y_{\lambda}) = D(p_{\lambda}(X,Y)||p_{\lambda}(X)p_\lambda(Y))$$ where (as shown in the text)
$p_{\lambda}(X,Y) = p(X) p_\lambda(Y|X)=\lambda p_1(x,y)+(1-\lambda)p_2(x,y)= p(x) (\lambda p_1(y|x)+(1-\lambda)p_2(y|x))$
I.e.: the "mixture" can be written in terms of the variable $p(y|x)$ as well as in terms of the joint $p(x,y)$ - and the same happens for the marginal $p(y)$. Then we can also write
$p_{\lambda}(X)p_\lambda(Y) = \lambda p(x)p_1(y)+(1-\lambda)p(x)p_2(y)$
Then, pluggin this into the second equation, and using the first inequation:
$$I(X_{\lambda};Y_{\lambda}) \le \lambda I(X_1;Y_1) + (1-\lambda) I(X_2;Y_2)$$
which means that $I(X;Y)$ is convex with respect to the mixed variable (in this case, $p(y|x)$)