Applying a Stochastic Computation Graph + DiCE operator

42 Views Asked by Bumbble Comm At 09 Apr 2026 - 6:12

I am following this paper and I cannot workout the example in 3.3. https://arxiv.org/abs/1802.05098

In the paper, they propose the $DICE$ operator and before they give the following example in 3.3:

Let $x \sim Ber(\theta)$ and $f(x,\theta)=x(1-\theta)+(1-x)(1-\theta)$.

The authors calculate the surrogate loss as:

$\mathcal{L} = \theta(1-\theta)+(1-\theta)(1+\theta)$

then they calculate the gradient as

$\nabla_\theta \mathcal{L} = -4 \theta + 1$

they say they "can exactly evaluate all terms". How did they do that?

Note: The surrogate loss is a function that helps to create a non-biased gradient estimator for $\nabla_\theta f$. This means $\nabla_\theta \mathcal{L} = \nabla_\theta \mathbb{E}_x f(x,\theta)$

The part I am confused with is the step from $f(x,\theta)$ to $\mathcal{L}$. How does $x$ disappears?

Following the steps I get to:

$\mathcal{W} = \{ x \}$ and $c = \{ f \} $

$\mathcal{L}_{DICE} = \sum_{c \in \mathcal{C}} c \cdot DICE(\mathcal{W})$ as per definition

$\mathcal{L}_{DICE} = DICE(x) \cdot x (1-\theta)+(1-x)(1-\theta)$

Original Q&A

There are 1 best solutions below

Bumbble Comm On 18 Aug 2018 - 4:09

The answer is actually quite simple :) doh!

Given that:

$\nabla_\theta \mathcal{L} = \nabla_\theta \mathbb{E}_x f(x,\theta)$

Then: $\mathcal{L} = \mathbb{E}_x f(x,\theta)$

so if

$x \sim Ber(\theta)$
$f(x,\theta)=x(1-\theta)+(1-x)(1-\theta)$
and $\mathbb{E}_x Ber(\theta) = \theta$

then

$\mathcal{L} =\mathbb{E}_x f(x,\theta) = \mathbb{E}_x(x) (1-\theta)+(1-\mathbb{E}_x(x))(1-\theta)$

$\mathcal{L} = \theta (1-\theta)+(1-\theta)(1-\theta)$

Applying a Stochastic Computation Graph + DiCE operator

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in STOCHASTIC-APPROXIMATION

Trending Questions

Popular # Hahtags

Popular Questions