I am interested in minimizing a loss function involving probability densities of kernel density estimation functions (KDEs). One of the terms I obtain is this one: $$\int _{X}\left(\int _{Z} p( z) K( x-g_{\theta }( z)) dz\right)^{2} dx$$
Eventually, I would like to get an expectation that I can turn into an estimator of the gradient to train a neural network model $g$ parametrized by $\theta$.
So I try to differentiate with respect to $\theta$: $$ \begin{align} \nabla _{\theta }\int _{X}\left(\int _{Z} p( z) K( x-g_{\theta }( z)) dz\right)^{2} dx &= \int _{X} \nabla _{\theta }\left(\int _{Z} p( z) K( x-g_{\theta }( z)) dz\right)^{2} dx \\ &= 2\int _{X}\left(\int _{Z} p( z) K( x-g_{\theta }( z)) dz\right) \nabla _{\theta }\left(\int _{Z} p( z) K( x-g_{\theta }( z)) dz\right) dx\\ &=2\int _{X}\int _{Z} p( z) K( x-g_{\theta }( z)) dz\int _{Z} p( z) \nabla _{\theta } K( x-g_{\theta }( z)) dzdx\\ &=2\int _{X} E_{Z}[ K( x-g_{\theta }( Z))] E_{Z}[ \nabla _{\theta } K( x-g_{\theta }( Z))] dx\\ &=2E_{X\sim E_{Z}[ K( x-g_{\theta }( Z))]}[ E_{Z}[ \nabla _{\theta } K( X-g_{\theta }( Z))]]\\ &= 2E_{Z,X\sim E_{Z}[ K( x-g_{\theta }( Z))]}[ \nabla _{\theta } K( X-g_{\theta }( Z))] \end{align} $$
I am not completely certain I am correctly applying the chain rule between line $1$ and $2$, that's the first issue.
Then I am not sure if I am correct in going to the last line from the previous one. $E_{Z}[ K( x-g_{\theta }( Z))]$ should be a correct probability density function on $X$ so the previous line should be correct. However, I am not sure if I can collapse the two expectations into one like that.
I don't know if it helps nor if it is correct but looking at the integral representation of the expectations, it seem to suggest that the two variables $Z$ and $E_{Z}[ K( x-g_{\theta }( Z))]$ should be independent in this context, although $Z$ participates in the computation of $E_{Z}[ K( x-g_{\theta }( Z))]$.
My intuition is that they should be independent because in this case $Z$ is "defined" by the inner expectation, it has no actual relation with the outer $Z$.
So if someone can clear these doubts and confirm that the proof seems correct, I would appreciate it. There is still one additional question I would like to ask: assuming the two variables are indeed independent, can I gain something by taking the same $Z$ for the computation of both variables, deliberately making them dependent?