Using anchored ensembling it is possible to estimate the mean $\mu$ and the variance $\sigma^2$ of a output. I had the insight than if I then sampled from $N(\mu,\sigma^2)$, I could estimate both the aleatoric (data) and epistemic (model) uncertainty in one shot and use it to guide exploration. I could do this optimization using samples, but it has occurred to me that propagating uncertainty through the TD error might result in much less variance in the gradients.
Thanks to the uncertainty propagation page I understand how that might be done, but I am missing the last step which is the cost function.
At first glance it might seem that the sixth example formula ($f=aA^b$) should be enough to derive the squared error, but it is obvious to me that this would just pull the variance downward if I used it as cost directly.
How do I integrate: $E_{a \sim N({\mu_1},{\sigma_1}),b \sim N({\mu_2},{\sigma_2})}(a - b)^2$? I am hoping for something neat that I could backprop through.