One-point gradient estimator and Stokes' theorem.

273 Views Asked by At

In bandit convex optimization, we are only given access to zeroth-order oracle of a function but not first-order (gradient) oracle. Hence, people often use some one-point gradient estimator to approximate the gradient of a function at a certain point. More specifically, let $f:\mathbb{R}^d\mapsto \mathbb{R}$ be a function in $\mathbb{R}^d$, $\mathbb{S}$ be the unit sphere in $\mathbb{R}^d$, $\mathbb{B}$ be the unit ball in $\mathbb{R}^d$. For a point $\mathbb{x}\in\mathbb{R}^d$, if we want to estimate $\nabla f(\mathbf{x})$, we randomly sample a $\mathbf{u}$ from $\mathbb{S}$. Let $\delta>0$ be a small positive scalar. We query the zeroth-order oracle and obtain $f(\mathbf{x}+\delta \mathbf{u})$. We then use $\frac{f(\mathbf{x}+\delta \mathbf{u})d\mathbf{u}}{\delta}$ as an approximation of $\nabla f(\mathbf{x})$. Lemma 1 in this paper proved that the expectation of the approximation is the gradient of a smoothed version of $f$, that is let $\hat{f}(\mathbf{x})=\mathbb{E}_{\mathbf{v}\in\mathbb{B}}[f(\mathbf{x}+\delta \mathbf{v})]$, then $\nabla \hat{f}(x)=\mathbb{E}_{\mathbf{u}\in \mathbb{S}}\big[\frac{f(\mathbf{x}+\delta \mathbf{u})d\mathbf{u}}{\delta}\big]$. The proof invoked Stokes's theorem to show the following (equation (15) in the paper): \begin{align} \nabla \int_{\delta \mathbb{B}}f(\mathbf{x}+\mathbf{v})\mathrm{d}\mathbf{v}=\int_{\delta \mathbb{S}}f(\mathbf{x}+\mathbf{u})\frac{\mathbf{u}}{\|\mathbf{u}\|}\mathrm{d}\mathbf{u}. \end{align} I'm trying to understand how exactly does the above follow from the Stokes's theorem. Is it that the curl of the vector field defined as $f(\mathbf{x}+\mathbf{u})\frac{\mathbf{u}}{\|\mathbf{u}\|}$ somehow is related to the left-hand-side?

1

There are 1 best solutions below

4
On

WLOG, $\delta=1, x=0$. Pick arbitrary (constant) vector $c$. Let $F=f c$. Then by divergence theorem

$$\int_S f(u) c \cdot u\: d u = \int_B \operatorname{div} (f(v)c) \: dv=\int_B (\nabla f \cdot c) \: d v$$

or $$c \cdot \int_S f(u) u \:d u = c \cdot \int_B (\nabla f) \:d v$$

Since $c$ was arbitrary, $$\int_S f(u) u \: d u = \int_B (\nabla f) \:d v$$

(Note: you can use $c=e_1$ to get that the two vectors have the same first coordinate, then use $c=e_2$ to get that they have the same second coordinate etc. ; but I like the above coordinate-less way better.)

If $f$ is regular enough, you can exchange $\nabla$ and $\int$ and recover the formula in the question.


You don't need Stokes. Again, it clearly suffices to treat the case $\delta =1$ and $x=0$. Let's say our coordinates are $x_0, x_1, ..., x_{d-1}$ and we want to show that the $0$-th components of the LHS and RHS are the same. Writing $v=u_0 e_0+ v_{rest}$ we get that by Fubini's for a fixed scalar $s$

$$\int_{B^d} f(s e_0+v) \: dv =\int_{s-v_0(v_{rest})}^{s+v_0(v_{rest})} \left(\int_{B^{d-1}} f(u_0e_0+v_{rest}) \: d v_{rest}\right) d u_0$$

Here $v_0(v_{rest})=(1-|v_{rest}|^2)^{1/2}$.

LHS has $0$th component

$$ \frac{d}{ds }\int_{B^d} f(s e_0+v) \: dv$$ which by the above formula and the fundamental theorem of calculus is

$$\int_{B^{d-1}} \left(f(v_0(v_{rest})e_0+v_{rest}) -f(-v_0(v_{rest})e_0+v_{rest})\right)\: dv_{rest} $$

The "upper" hemisphere is parameterized by the inverse of the projection map to $B^{d-1}$, whose $(d-1)$-dimensional volume stretching factor at $v=v_0(v_{rest}) e_0 +v_{rest}$ is $ \frac{1}{v_0}$ (proof sketch: using $d-1$ dimensional polar coordinates on $B^{d-1}$ the angular tangent vectors are undistorted, and the radial tangent is stretched by $ \frac{1}{v_0}$, just as in the $d=2$ case). Hence (by definition of integration over a manifold)

$$\int_{B^{d-1}} \left(f(v_0(v_{rest})e_0+v_{rest}) \right)dv_{rest}=\int_{S^{d-1}_{upper}} f(v) v_0 \: dv. $$

Similarly,

$$\int_{B^{d-1}} \left(-f(v_0(v_{rest})e_0+v_{rest}) \right) \: dv_{rest}=\int_{S^{d-1}_{lower}} f(v) v_0\: dv .$$

Adding these we get $ \int_{S^{d-1}} f(v)v_0 \: dv$, aka the $0$th component of the RHS. The desired result follows.


The stretching factor in more detail:

We have $B^d\setminus 0 = S^{d-1} \times (0,1)$ via $(\vec{u}, r) \to r\vec{u}$. This is also diffeomorphic to punctured open hemisphere via $(\vec{u}, r) \to (r\vec{u}, \sqrt{1-r^2})\in S^d$.

The "inverse projection" in these coordinates is thus $i:r\vec{u}\to (r\vec{u}, \sqrt{1-r^2})$. We have that any tangent vector $\delta \vec{u}$ to $B^d\setminus 0$ which is orthogonal to $\vec{u}$ is mapped to the tangent vector to $S^{d}$ equal to $(\delta \vec{u}, 0)$. These tangent vectors are not stretched. The tangent vector $\vec{u}\delta r$ parallel to $u$ is mapped to $ (\vec{u}\delta r, (\frac{d(\sqrt{1-r^2})}{dr})\delta r)=(\vec{u}\delta r, -\frac{r}{\sqrt{1-r^2}}\delta r)$. The image is still orthogonal to all the $\delta \vec{u}$ which are perpendicular to $\vec{u}$, but now has length $\frac{1}{\sqrt{1-r^2}}\delta r $, i.e. is stretched by a factor of $\frac{1}{\sqrt{1-r^2}}$, which is also the volume stretch factor. In our old notation this is $\frac{1}{v_0}$.