In the proof given by [1] on p. 92, I can't seem to understand what happens in the second-last step. We have two spaces $\Omega_1$ and $\Omega_2$ and a random variable $(X, Y)$ on $\Omega_1 \times \Omega_2$. $X$ and $Y$ in general will not be independent. $k_1$ and $k_2$ are kernels of RKHS $H_1$ and $H_2$ on the two spaces $\Omega_1$ and $\Omega_2$ , respectively.
In the second last step, we seem to be upper-bounding a conditional expectation by the product of the two marginal expectations: $$E_{XY}[||k_1(\cdot, X)||_{H_1} \cdot ||k_2(\cdot, Y)||_{H_2} ] \cdot C + R $$ $$ \leq E_{X}[k_1(X, X)]^{1/2}\cdot E_{Y}[k_2(Y, Y)]^{1/2} \cdot C + R $$
($C$ and $R$ stand for other parts of the equation.)
The transformation of $||k(\cdot, X)||_{H_1}$ into $k(X, X)^{1/2}$ comes from the reproducing property, and then they probably move the square root outside the expectation by Jensen's.
But in general, the conditional expectation isn't bounded by the marginal distributions in this way - I think it's easy to find counterexamples (right?) But here, kernels of RKHS are involved, which I'm not very firm with yet.
Why is this bound possible?
[1] Fukumizu, Bach, Jordan (JMLR 2004): Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces. p. 92/ proof of theorem 1 , equation (17)