I am following this paper and I cannot workout the example in 3.3. https://arxiv.org/abs/1802.05098
In the paper, they propose the $DICE$ operator and before they give the following example in 3.3:
Let $x \sim Ber(\theta)$ and $f(x,\theta)=x(1-\theta)+(1-x)(1-\theta)$.
The authors calculate the surrogate loss as:
$\mathcal{L} = \theta(1-\theta)+(1-\theta)(1+\theta)$
then they calculate the gradient as
$\nabla_\theta \mathcal{L} = -4 \theta + 1$
they say they "can exactly evaluate all terms". How did they do that?
Note: The surrogate loss is a function that helps to create a non-biased gradient estimator for $\nabla_\theta f$. This means $\nabla_\theta \mathcal{L} = \nabla_\theta \mathbb{E}_x f(x,\theta)$
The part I am confused with is the step from $f(x,\theta)$ to $\mathcal{L}$. How does $x$ disappears?
Following the steps I get to:
$\mathcal{W} = \{ x \}$ and $c = \{ f \} $
$\mathcal{L}_{DICE} = \sum_{c \in \mathcal{C}} c \cdot DICE(\mathcal{W})$ as per definition
so
$\mathcal{L}_{DICE} = DICE(x) \cdot x (1-\theta)+(1-x)(1-\theta)$
The answer is actually quite simple :) doh!
Given that:
$\nabla_\theta \mathcal{L} = \nabla_\theta \mathbb{E}_x f(x,\theta)$
Then: $\mathcal{L} = \mathbb{E}_x f(x,\theta)$
so if
then
$\mathcal{L} =\mathbb{E}_x f(x,\theta) = \mathbb{E}_x(x) (1-\theta)+(1-\mathbb{E}_x(x))(1-\theta)$
$\mathcal{L} = \theta (1-\theta)+(1-\theta)(1-\theta)$