Applying a Stochastic Computation Graph + DiCE operator

36 Views Asked by At

I am following this paper and I cannot workout the example in 3.3. https://arxiv.org/abs/1802.05098

In the paper, they propose the $DICE$ operator and before they give the following example in 3.3:


Let $x \sim Ber(\theta)$ and $f(x,\theta)=x(1-\theta)+(1-x)(1-\theta)$.

The authors calculate the surrogate loss as:

$\mathcal{L} = \theta(1-\theta)+(1-\theta)(1+\theta)$

then they calculate the gradient as

$\nabla_\theta \mathcal{L} = -4 \theta + 1$

they say they "can exactly evaluate all terms". How did they do that?


Note: The surrogate loss is a function that helps to create a non-biased gradient estimator for $\nabla_\theta f$. This means $\nabla_\theta \mathcal{L} = \nabla_\theta \mathbb{E}_x f(x,\theta)$

The part I am confused with is the step from $f(x,\theta)$ to $\mathcal{L}$. How does $x$ disappears?

Following the steps I get to:

$\mathcal{W} = \{ x \}$ and $c = \{ f \} $

$\mathcal{L}_{DICE} = \sum_{c \in \mathcal{C}} c \cdot DICE(\mathcal{W})$ as per definition

so

$\mathcal{L}_{DICE} = DICE(x) \cdot x (1-\theta)+(1-x)(1-\theta)$

1

There are 1 best solutions below

0
On

The answer is actually quite simple :) doh!

Given that:

$\nabla_\theta \mathcal{L} = \nabla_\theta \mathbb{E}_x f(x,\theta)$

Then: $\mathcal{L} = \mathbb{E}_x f(x,\theta)$

so if

  • $x \sim Ber(\theta)$
  • $f(x,\theta)=x(1-\theta)+(1-x)(1-\theta)$
  • and $\mathbb{E}_x Ber(\theta) = \theta$

then

$\mathcal{L} =\mathbb{E}_x f(x,\theta) = \mathbb{E}_x(x) (1-\theta)+(1-\mathbb{E}_x(x))(1-\theta)$

$\mathcal{L} = \theta (1-\theta)+(1-\theta)(1-\theta)$