clarification about the kl divergence between a continuous and a discrete distribution

229 Views Asked by At

I was reading this blog post on bayesian neural networks, where the author shows that if we use as a variational distribution a product of delta function, then minimizing the loss function of a BNN is equivalent to minimize the loss function of a standard neural network with a l2 regularization mechanism. However, I have read here on mathematics.stackexchange that the Kl divergence is only defined between continuous distribution and not between discrete and continuous distribution, so I was wondering if the demonstration is still approximatively right or not and if not, what changes should be made. thanks

enter image description here

1

There are 1 best solutions below

5
On BEST ANSWER

Yeah, so, the maths of this is very sloppy, but this is common with such exposition in applied contexts. Here I believe the argument the author wants to make is that if the posterior were to be a point, then the ELBO objective would reduce to a standard one. This basic point I think is quite true.


To get at this somewhat sensibly, take $q_{\theta, \varepsilon}$ to be normal around $\theta$ with a variance $\varepsilon I$. Think of this $\varepsilon$ as something small that is introduced for convenience, but that we're not optimising over (and will eventually send to $0$). Then the first term becomes $$ \int \frac{1}{(\sqrt{2\pi \varepsilon})^K} e^{-\|\omega - \theta \|^2/2\varepsilon} \log p(y_i|f^\omega(x_i) ) \mathrm{d}\omega,$$ and the second becomes (this comes from the KL between multivariate Gaussians) $$ \frac{K}{2} \log \frac{1}{\lambda\varepsilon} - \frac K2 + K \lambda \varepsilon + \frac{\lambda}{2} \|\theta\|^2 = g(\varepsilon) + - \frac{K}{2} \log \lambda + \frac{\lambda}{2} \|\theta\|^2 + K \lambda\varepsilon. $$

Now, here notice that $g(\varepsilon)$ is uninteresing from the point of view of a loss, since it doesn't interact with $\theta$ - the only thing we're optimising over. In optimisation terms this is like adding a constant to the objective function - this doesn't change anything about the solution of the optimisation problem. So from the perspective of deriving a loss on $\theta$, it is perfectly fine to drop this term. The same is actually true for the $K \lambda \varepsilon$ and other terms (but maybe they're also optimising over $\lambda,$ I don't know).

Note by the way that $g(\varepsilon)$ explodes as $\varepsilon \to 0$. This is related to the fact that the KL divergence between a discrete and continuous distribution is $\infty$.

In any case, accounting for all the $\theta$ and $\lambda$ dependent terms, the loss we end up with takes the form $$ \mathcal{L}_\varepsilon(\theta) := \sum_i \int \frac{1}{(\sqrt{2\pi \varepsilon})^K} e^{-\|\omega - \theta \|^2/2\varepsilon} \log p(y_i|f^\omega(x_i) ) \mathrm{d}\omega -\frac{K}{2} \log \lambda + \frac{\lambda}{2} \|\theta\|^2 + K \lambda \varepsilon. $$

Now, as $\varepsilon \to 0$, observe that $q_{\theta, \varepsilon}$ converges to $q_\theta$ (in some sense, but that's not too important), the integral converes to $\log p(y_i|f^\theta(x_i))$ (under some smoothness assumptions), and the final term goes to $0$. This gives the form of the loss they're motivating, the important terms of which are $$ \sum_i \log p(y_i|f^\theta(x_i)) + \lambda \|\theta\|^2/2 - K/2 \log \lambda.$$


Again, this is quite sloppy, but it does give something meaningful. In part the issue is that they're making the framework do something that it shouldn't - ultimately a Bayesian estimate would never be a point, but would be a distribution. So if you force it to be a point, then some weird things are going to happen. Ideally the author would have been explicit about considerations such as the above rather than hiding them away (and dropped other irrelevant terms like the $K/2$ term), but maybe this distracts from the context of the writeup a bit too much.