Backpropagate through stochastic node

Question

Backpropagate through stochastic node

464 Views Asked by Bumbble Comm At 22 Apr 2026 - 10:56

In's commonly said that in VAE, we use reparameterization trick because "we can't backpropagate through stochastic node"

It makes sense from the picture, but I found it hard to understand exactly what it means and why. Let's say X ~ N(u, 1).

And we want to compute $$\frac{d X}{d u}$$ which is not possible because the sampling operation is non-differentiable. That is, we don't know how changing u a little bit will affect how we got the sample X.

However, say in the MLE for Gaussian. We are trying to estimate the following quantity:

$$\sum_{i=1}^N \log p(X_i;u)$$ for which the derivative $$\frac{d \log p(X_i ; u)}{d u}$$ can be easily calculated. My confusion comes from the fact that $$\frac{d \log p(X_i ; u)}{d u} = \frac{d \log p(X_i ; u)}{d X_i} \frac{d X_i}{d u}$$ by the chain rule. If we can't compute $\frac{d X_i}{d u}$, why can we compute $\frac{d \log p(X_i ; u)}{d u}$ ?

Original Q&A

There are 2 best solutions below

**Bumbble Comm** · Answer 1 · 2022-02-10 13:27:49

I think your maximum likelihood equation is not correct. In particular, if I understand your setting correctly $X_i$ is your data, which does not depend on $u$. Their log-likelihood does depend on $u$ since for a constant variance it's essentially the MSE $\sum_i (X_i-u)^2$ multiplied and added with terms that don't depend on $u$. Then taking $\sum_i \frac{d\log p(X_i;u)}{du} = O(1)\sum_i (X_i-u) + O(1)$, where $O(1)$ are terms that don't depend on $X_i$ nor $u$.

Notice the final expression depends on $X_i$ but not because we differentiated through it. This is because $dX_i/du$ because is zero, since it's the data and does not depend on $u$.

**Bumbble Comm** · Answer 2 · 2022-02-13 22:12:35

I think you are confused about back propagation. You never need to take the gradient of the input with respect to anything (because there is no layer BEFORE the input), nor do you need to make assumptions about the distribution of the input.

The 'reparametrization trick' makes some assumption about the parametric form of the distribution of the latent vector $z$, and represents sampling from that latent space as the output of some function of the parameter values and a noise vector. That allows you to backprop through the latent vector $z$ by taking the gradient of it with respect to the parameter values.

For example, if $z$ is assumed to be multivariate Gaussian, then $z_i = \mu_i + \sigma_i \epsilon_i$, where $\epsilon_i \sim N(0,1)$, and $$\frac{\partial z_i}{\partial \mu_i} = 1$$ $$\frac{\partial z_i}{\partial \sigma_i} = \epsilon_i$$ The vectors $\mu$ and $\sigma$ are learned, i.e. they are connected to the previous layer in the network, and you can backprop through them in the usual way. The random noise $\epsilon$ is drawn from a fixed distribution, not learned, so you do not backprop through those nodes (which is why it is orange in your diagram).

Backpropagate through stochastic node

There are 2 best solutions below

Related Questions in NEURAL-NETWORKS

Related Questions in REPARAMETERIZATION-TRICK

Related Questions in BACKPROPAGATION

Trending Questions

Popular # Hahtags

Popular Questions