I was reading this blog post on bayesian neural networks, where the author shows that if we use as a variational distribution a product of delta function, then minimizing the loss function of a BNN is equivalent to minimize the loss function of a standard neural network with a l2 regularization mechanism. However, I have read here on mathematics.stackexchange that the Kl divergence is only defined between continuous distribution and not between discrete and continuous distribution, so I was wondering if the demonstration is still approximatively right or not and if not, what changes should be made. thanks
2026-04-12 01:18:37.1775956717
clarification about the kl divergence between a continuous and a discrete distribution
229 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
There are 1 best solutions below
Related Questions in PROBABILITY-THEORY
- Is this a commonly known paradox?
- What's $P(A_1\cap A_2\cap A_3\cap A_4) $?
- Another application of the Central Limit Theorem
- proving Kochen-Stone lemma...
- Is there a contradiction in coin toss of expected / actual results?
- Sample each point with flipping coin, what is the average?
- Random variables coincide
- Reference request for a lemma on the expected value of Hermitian polynomials of Gaussian random variables.
- Determine the marginal distributions of $(T_1, T_2)$
- Convergence in distribution of a discretized random variable and generated sigma-algebras
Related Questions in STATISTICS
- Given is $2$ dimensional random variable $(X,Y)$ with table. Determine the correlation between $X$ and $Y$
- Statistics based on empirical distribution
- Given $U,V \sim R(0,1)$. Determine covariance between $X = UV$ and $V$
- Fisher information of sufficient statistic
- Solving Equation with Euler's Number
- derive the expectation of exponential function $e^{-\left\Vert \mathbf{x} - V\mathbf{x}+\mathbf{a}\right\Vert^2}$ or its upper bound
- Determine the marginal distributions of $(T_1, T_2)$
- KL divergence between two multivariate Bernoulli distribution
- Given random variables $(T_1,T_2)$. Show that $T_1$ and $T_2$ are independent and exponentially distributed if..
- Probability of tossing marbles,covariance
Related Questions in MACHINE-LEARNING
- KL divergence between two multivariate Bernoulli distribution
- Can someone explain the calculus within this gradient descent function?
- Gaussian Processes Regression with multiple input frequencies
- Kernel functions for vectors in discrete spaces
- Estimate $P(A_1|A_2 \cup A_3 \cup A_4...)$, given $P(A_i|A_j)$
- Relationship between Training Neural Networks and Calculus of Variations
- How does maximum a posteriori estimation (MAP) differs from maximum likelihood estimation (MLE)
- To find the new weights of an error function by minimizing it
- How to calculate Vapnik-Chervonenkis dimension?
- maximize a posteriori
Related Questions in INFORMATION-THEORY
- KL divergence between two multivariate Bernoulli distribution
- convexity of mutual information-like function
- Maximizing a mutual information w.r.t. (i.i.d.) variation of the channel.
- Probability of a block error of the (N, K) Hamming code used for a binary symmetric channel.
- Kac Lemma for Ergodic Stationary Process
- Encryption with $|K| = |P| = |C| = 1$ is perfectly secure?
- How to maximise the difference between entropy and expected length of an Huffman code?
- Number of codes with max codeword length over an alphabet
- Aggregating information and bayesian information
- Compactness of the Gaussian random variable distribution as a statistical manifold?
Trending Questions
- Induction on the number of equations
- How to convince a math teacher of this simple and obvious fact?
- Find $E[XY|Y+Z=1 ]$
- Refuting the Anti-Cantor Cranks
- What are imaginary numbers?
- Determine the adjoint of $\tilde Q(x)$ for $\tilde Q(x)u:=(Qu)(x)$ where $Q:U→L^2(Ω,ℝ^d$ is a Hilbert-Schmidt operator and $U$ is a Hilbert space
- Why does this innovative method of subtraction from a third grader always work?
- How do we know that the number $1$ is not equal to the number $-1$?
- What are the Implications of having VΩ as a model for a theory?
- Defining a Galois Field based on primitive element versus polynomial?
- Can't find the relationship between two columns of numbers. Please Help
- Is computer science a branch of mathematics?
- Is there a bijection of $\mathbb{R}^n$ with itself such that the forward map is connected but the inverse is not?
- Identification of a quadrilateral as a trapezoid, rectangle, or square
- Generator of inertia group in function field extension
Popular # Hahtags
second-order-logic
numerical-methods
puzzle
logic
probability
number-theory
winding-number
real-analysis
integration
calculus
complex-analysis
sequences-and-series
proof-writing
set-theory
functions
homotopy-theory
elementary-number-theory
ordinary-differential-equations
circles
derivatives
game-theory
definite-integrals
elementary-set-theory
limits
multivariable-calculus
geometry
algebraic-number-theory
proof-verification
partial-derivative
algebra-precalculus
Popular Questions
- What is the integral of 1/x?
- How many squares actually ARE in this picture? Is this a trick question with no right answer?
- Is a matrix multiplied with its transpose something special?
- What is the difference between independent and mutually exclusive events?
- Visually stunning math concepts which are easy to explain
- taylor series of $\ln(1+x)$?
- How to tell if a set of vectors spans a space?
- Calculus question taking derivative to find horizontal tangent line
- How to determine if a function is one-to-one?
- Determine if vectors are linearly independent
- What does it mean to have a determinant equal to zero?
- Is this Batman equation for real?
- How to find perpendicular vector to another vector?
- How to find mean and median from histogram
- How many sides does a circle have?

Yeah, so, the maths of this is very sloppy, but this is common with such exposition in applied contexts. Here I believe the argument the author wants to make is that if the posterior were to be a point, then the ELBO objective would reduce to a standard one. This basic point I think is quite true.
To get at this somewhat sensibly, take $q_{\theta, \varepsilon}$ to be normal around $\theta$ with a variance $\varepsilon I$. Think of this $\varepsilon$ as something small that is introduced for convenience, but that we're not optimising over (and will eventually send to $0$). Then the first term becomes $$ \int \frac{1}{(\sqrt{2\pi \varepsilon})^K} e^{-\|\omega - \theta \|^2/2\varepsilon} \log p(y_i|f^\omega(x_i) ) \mathrm{d}\omega,$$ and the second becomes (this comes from the KL between multivariate Gaussians) $$ \frac{K}{2} \log \frac{1}{\lambda\varepsilon} - \frac K2 + K \lambda \varepsilon + \frac{\lambda}{2} \|\theta\|^2 = g(\varepsilon) + - \frac{K}{2} \log \lambda + \frac{\lambda}{2} \|\theta\|^2 + K \lambda\varepsilon. $$
Now, here notice that $g(\varepsilon)$ is uninteresing from the point of view of a loss, since it doesn't interact with $\theta$ - the only thing we're optimising over. In optimisation terms this is like adding a constant to the objective function - this doesn't change anything about the solution of the optimisation problem. So from the perspective of deriving a loss on $\theta$, it is perfectly fine to drop this term. The same is actually true for the $K \lambda \varepsilon$ and other terms (but maybe they're also optimising over $\lambda,$ I don't know).
Note by the way that $g(\varepsilon)$ explodes as $\varepsilon \to 0$. This is related to the fact that the KL divergence between a discrete and continuous distribution is $\infty$.
In any case, accounting for all the $\theta$ and $\lambda$ dependent terms, the loss we end up with takes the form $$ \mathcal{L}_\varepsilon(\theta) := \sum_i \int \frac{1}{(\sqrt{2\pi \varepsilon})^K} e^{-\|\omega - \theta \|^2/2\varepsilon} \log p(y_i|f^\omega(x_i) ) \mathrm{d}\omega -\frac{K}{2} \log \lambda + \frac{\lambda}{2} \|\theta\|^2 + K \lambda \varepsilon. $$
Now, as $\varepsilon \to 0$, observe that $q_{\theta, \varepsilon}$ converges to $q_\theta$ (in some sense, but that's not too important), the integral converes to $\log p(y_i|f^\theta(x_i))$ (under some smoothness assumptions), and the final term goes to $0$. This gives the form of the loss they're motivating, the important terms of which are $$ \sum_i \log p(y_i|f^\theta(x_i)) + \lambda \|\theta\|^2/2 - K/2 \log \lambda.$$
Again, this is quite sloppy, but it does give something meaningful. In part the issue is that they're making the framework do something that it shouldn't - ultimately a Bayesian estimate would never be a point, but would be a distribution. So if you force it to be a point, then some weird things are going to happen. Ideally the author would have been explicit about considerations such as the above rather than hiding them away (and dropped other irrelevant terms like the $K/2$ term), but maybe this distracts from the context of the writeup a bit too much.