Derivation of variance-reduced gradient estimate

51 Views Asked by At

In this paper by Ilyas et al., there is this variance-reduced gradient estimate $$\nabla \mathbb{E}[F(\theta)] \approx \frac{1}{\sigma n}\sum_{i=1}^{n}\delta_{i} F(\theta + \sigma \delta_{i})$$ for $\delta \sim \mathcal{N}(0, I)$. Specifically, this is in the case of antithetic sampling, where $\delta_{j} = -\delta_{n-j+1}$ (as a means of reducing variance in the gradient estimate). I am trying to rederive this approximation.

From what I can tell, this estimate is probably derived via finite-difference, so we have something like $$\nabla \mathbb{E}[F(\theta)] \approx \frac{1}{n}\sum_{i=1}^{n}\frac{F(\theta + \sigma \delta_{i}) - F(\theta)}{\sigma \delta_{i}},$$ and with antithetic sampling resulting in sign changes, the $F(\theta)$ terms cancel each other out.

But the math doesn't quite work out, and I can't tell how to pull $\delta_{i}$ up into the numerator (since it is not necessarily a unit vector). I have looked up terms like "finite-order gradient estimate" or "gradient estimation techniques" and have only been able to find recent research which builds upon results such as this one, and so I can't help but feel like this derivation has since been passed in to tribal knowledge.

How do you derive this result?

1

There are 1 best solutions below

2
On BEST ANSWER

I believe it's simpler than you think. It's a simple thing, just confusing notation.

For background:\begin{align} \mathbb{E}_{\theta\sim\pi(\theta|x)}\left[ F(\theta) \right] &= \int F(\theta)\pi(\theta|x) \,\text{d}\theta \tag{1} \\ \nabla_x \mathbb{E}_{\theta\sim\pi(\theta|x)}\left[ F(\theta) \right] &= \mathbb{E}_{\theta\sim\pi(\theta|x)}\left[ F(\theta)\nabla_x\log \pi(\theta|x) \right] \tag{2} \end{align} using the reinforce log-derivative trick. Note that $\delta\sim\mathcal{N}(0,I)$, which is not expected to be affected by the antithetical sampling (though in a theoretical sense it may be). Since $\theta = x +\sigma\delta$, we have that $\pi(\theta|x) = \mathcal{N}(x,\sigma^2 I)$, so \begin{align} \log \pi(\theta|x) &= -\frac{k}{2}\log(2\pi) + \log\det(\sigma^2 I)^{-1/2} -\frac{1}{2}(\theta - x)^T\sigma^{-2}(\theta-x) \\ \nabla_x \log \pi(\theta|x) &= -\frac{\sigma^{-2}}{2} \nabla_x\sum_i (\theta_i - x_i)^2 \\ &= {\sigma^{-2}} (\theta - x) = \sigma^{-1}\delta \end{align} Nice. Now, we use a Monte Carlo approximation on equation (2), with $n$ samples. This is where some horrible notation is coming into play. While before they were considering a fixed (deterministic) $x$ and a random variable $\theta = x + \sigma\delta$, now they are considering a fixed single $\theta$ value (instead of just calling it $x$). I will try to rewrite it to make it clearer (mostly for myself). $$ \mathbb{E}_{\theta\sim\pi(\theta|x)}\left[ F(\theta)\nabla_x\log \pi(\theta|x) \right] \approx \frac{1}{n}\sum_{j=1}^n F(\theta_j) \sigma^{-1}\delta_j = \frac{1}{n \sigma}\sum_{j=1}^n F(x + \sigma\delta_j) \delta_j $$ where $j$ indexes the samples $\theta_j = x + \delta_j\sigma$. This makes much more sense (in their notation, they simply renamed $x$ to $\theta$). Notice that in equation (2) we are integrating out the $\theta$ variable (i.e., marginalizing over it, or taking the expectation over it), so it should not appear in the final output.

PS: regarding finite difference, it is actually designed to replace the finite differencing, which is infeasible in high dimensions - hence the use of this cheaper randomized natural evolutionary strategies approach instead.