Stochastic gradient descent as kernel convolution

79 Views Asked by At

I am currently trying to understand the paper: An Alternative View: When Does SGD Escape Local Minima?

There they argue that stochastic gradient descent is approximating gradient descent on the sequence of functions $g_t(y) = E_{\omega_t}[f(y-\eta\omega_t)]$, where $\omega_t \sim W(x_t)$ (see after equation 1), that is $f$ convolved with $W$. I am trying to understand what distribution $W(x_t)$ is here. In the usual setting we have $f(x) = \frac{1}{n}\sum_{j=1}^n f_j(x)$ where at each step instead of computing $x_{t+1} = x_t - \eta \nabla f(x_t)$ we use $x_{t+1} = x_t - \eta \nabla f_k(x_t)$ for some random $k \in \{1,\ldots, n\}$. Then from the paper, the noise would be $\omega^k_t = \nabla f_k(x_t) - \nabla f(x_t)$.

It is clear that I can get only $n$ different vectors: $\omega^1_t, \ldots, \omega^n_t$ each with a probability of $\frac{1}{n}$, so my conclusion is that $E_{\omega_t}[f(y-\eta\omega_t)] = \frac{1}{n}\sum_{j=1}^n f(y-\eta \omega^j_t)$, i.e. $W(\omega) = \frac{1}{n}\sum_{j=1}^n\delta(\omega-\omega_t^j)$. From the paper I had the impression that the convolution was with a continuous kernel, but to me it looks like this cannot be the case unless I initially have something of the form $f(x) = \frac{1}{|\Omega|}\int_{\Omega}f_z(x)\,dz$ instead of $f(x) = \frac{1}{n}\sum_{j=1}^nf_j(x)$. Am I terribly misunderstanding something, or is the kernel "discrete" (a sequence of Dirac deltas) in the paper?