I need your absolution for the following:
Say I have a target variable $Y$, some covariates $X$ and parameters $\theta_0, \theta_1$. Although $Y$ is not independent of $\theta_0$, it is conditionally independent, i.e. $(Y\perp \theta_0)|(X, \theta_1)$. Furthermore, $\theta_0$ and $\theta_1$ are assumed to be dependent.
Now say I have access to some variational posterior approximation of the joint $q_\phi(\theta_1, \theta_0) \approx p(\theta_1, \theta_0 |X,Y)$ with optimization parameter $\phi$.
In order to optimize $\phi$, I can maximize the evidence lower bound: \begin{align} \log p (Y|X) = \log \int p(Y|X, \theta_1, \theta_0) dp(\theta_1, \theta_0)\\ = \log \int p(Y|X, \theta_1, \theta_0) \frac{p(\theta_1, \theta_0)}{q_\phi(\theta_1, \theta_0)} dq_\phi(\theta_1, \theta_0)\\ \geq \int \log p(Y|X, \theta_1, \theta_0) - \log \frac{q_\phi(\theta_1, \theta_0)}{p(\theta_1, \theta_0)} dq_\phi(\theta_1, \theta_0)\\ = \mathbb{E}_{q_\phi(\theta_1, \theta_0)}[\log p(Y|X, \theta_1, \theta_0)] - D_{KL} (q_\phi(\theta_1,\theta_0)||p(\theta_1, \theta_0))\\ \mathbb{E}_{q_\phi(\theta_1)}[\log p(Y|X, \theta_1)] - D_{KL} (q_\phi(\theta_1,\theta_0)||p(\theta_1, \theta_0)), \end{align} where the inequality is due to Jensen's inequality and the last line follows due to the conditional independence assumption from above.
On the other hand (leaving out some of the steps), I have that \begin{align} \log p (Y|X) = \log \int p(Y|X, \theta_1) dp(\theta_1)\\ \geq \mathbb{E}_{q_\phi(\theta_1)}[\log p(Y|X, \theta_1)] - D_{KL} (q_\phi(\theta_1)||p(\theta_1))\\ \neq \mathbb{E}_{q_\phi(\theta_1)}[\log p(Y|X, \theta_1)] - D_{KL} (q_\phi(\theta_1,\theta_0)||p(\theta_1, \theta_0)). \end{align}
I understand that these two objectives differ as in the first case Jensen's inequality is also applied to the dimension $\theta_0$.
Since $\theta_0$ is "non-informative" in some sense, I assume that the objective $\mathbb{E}_{q_\phi(\theta_1)}[\log p(Y|X, \theta_1)] - D_{KL} (q_\phi(\theta_1)||p(\theta_1))$ should be the better objective and easier to optimize.
However, in my case, it is intractable to compute the KL-divergence of the marginal distributions of $\theta_1$. So I have to resort to $D_{KL} (q_\phi(\theta_1,\theta_0)||p(\theta_1, \theta_0))$ and hence the ELBO \begin{align} \mathbb{E}_{q_\phi(\theta_1)}[\log p(Y|X, \theta_1)] - D_{KL} (q_\phi(\theta_1,\theta_0)||p(\theta_1, \theta_0)). \end{align}
Now I am wondering, is there anything I am missing and are there any flaws with this ELBO? As I said, I get that KL-divergence minimization in the $\theta_0$ dimension is an additional layer of complexity, but does it make anything "incorrect"?
Thanks everybody :)