Understanding gradient descent as an integral

99 Views Asked by At

The update rule in gradient descent on $\mathbf w$ over cost function $J(\mathbf w)$ is given by: $$\mathbf w_{t+1}=\mathbf w_t-\eta\nabla_{\mathbf w}J(\mathbf w_t)$$

This rule seems to integrate the signal $\nabla_{\mathbf w_t}J(\mathbf w)$ over $t$ time steps. Therefore, at convergence, is it correct to write the following equation?

$$\mathbf w_{\text{opt}} = \mathbf w_{\text{init}} -\eta\int_0^\infty \nabla_{\mathbf w}J(\mathbf w_t) \, \mathrm{d}t$$

So, does that mean if the integral term above is analytically solvable — I suspect that it is not the case in general but for simple cost functions, maybe — we can predetermine $\mathbf w_{\text{opt}}$ given the cost function and the initial position $\mathbf w_{\text{init}}$? Wouldn't it be so that in such special cases GD is not needed at all, like for example when $J$ is known to be convex/concave in the domain $\mathbf w$?

1

There are 1 best solutions below

0
On BEST ANSWER

Derivative of the GD update rule with respect to $t$ can be given as: $$\frac{\mathrm{d}\mathbf w^t}{\mathrm{d}t}=\frac{\mathrm{d}\mathbf w^{t-1}}{\mathrm{d}t}-\eta\frac{\mathrm{d}{\nabla_{\mathbf w}J(\mathbf w^{t-1})}}{\mathrm{d}t}$$ $$=\frac{\mathrm{d}\mathbf w^{t-1}}{\mathrm{d}t}-\eta\frac{\mathrm{d}{\nabla_{\mathbf w}J(\mathbf w^{t-1})}}{\mathrm{d}\mathbf w^{t-1}}\cdot\frac{\mathrm{d}\mathbf w^{t-1}}{\mathrm{d}t}$$

where $\frac{\mathrm{d}{\nabla_{\mathbf w}J(\mathbf w^t)}}{\mathrm{d}\mathbf w^t}$ is the Hessian $\mathbf H_\mathbf w(t)$. Simplifying, we have:

$$\frac{\mathrm{d}\mathbf w^{t}}{\mathrm{d}t}=\left[\mathbf I-\eta\mathbf H_\mathbf w(t-1)\right]\frac{\mathrm{d}\mathbf w^{t-1}}{\mathrm{d}t}$$

Unrolling this over $t$ steps, we can write: $$\frac{\mathrm{d}\mathbf w^{t}}{\mathrm{d}t}=\left(\prod_{i=1}^{t}\left[\mathbf I-\eta\mathbf H_\mathbf w(t-i)\right]\right)\frac{\mathrm{d}\mathbf w^{init}}{\mathrm{d}t}$$

Integrating with respect to $\mathrm{d}t$: $$\mathbf w^{t}=\left(\prod_{i=1}^{t}\left[\mathbf I-\eta\mathbf H_\mathbf w(t-i)\right]\right)\mathbf w^{init}$$

$\mathbf w^{opt}$ will be found if $\lim_{t\to\infty}\mathbf w^{t}$ converges, which is $$\mathbf w^{opt}=\left(\prod_{i=1}^\infty\left[\mathbf I-\eta\mathbf H_\mathbf w(t-i)\right]\right)\mathbf w^{init}$$ $$=\mathbf M\mathbf w^{init}$$

This shows that GD finds a linear transformation $\mathbf M$ that changes a given $\mathbf w^{init}$ to $\mathbf w^{opt}$. This also shows that the relation between $\mathbf M$ and $\mathbf w^{init}$ is unique, i.e., for a given $\mathbf w^{init}$, GD will always converge to the same $\mathbf w^{opt}$.

Furthermore, knowing $\mathbf M$ requires the knowledge of the infinite product of Hessians $\mathbf H_\mathbf w (t)$ which is analytically intractable in general. For situations where $\mathbf H_\mathbf w (t)$ can be known, an analytical solution is indeed possible. Consider for example when $J(\mathbf w) := \frac{1}{2}\|\mathbf w\|^2$, then $\mathbf H_\mathbf w (t) = \mathbf I$. In that case, the above equation simplifies to: $$\mathbf w^{opt}=\mathbf (1-\eta)^\infty \mathbf w^{init}$$

which converges to $\mathbf 0$ for $\eta \leq 1$ and diverges otherwise.


If we must force the form given in OP, we can add and subtract $\mathbf w_init$ on the right side then rearrange and add an integral to obtain:

$$\mathbf w^{t}=\mathbf w^{init}-\int_0^t\frac{1-\left(\prod_{i=1}^{t}\left[\mathbf I-\eta\mathbf H_\mathbf w(t-i)\right]\right)}{\mathrm{d}t}\mathrm{d}t\cdot\mathbf w^{init}$$ $$=\mathbf w^{init}+\int_0^t\frac{\prod_{i=1}^{t}\left[\mathbf I-\eta\mathbf H_\mathbf w(t-i)\right]}{\mathrm{d}t}\mathrm{d}t\cdot\mathbf w^{init}$$

Thus, the integrand is actually the rate of change of the product of Hessians instead of just $\nabla_{\mathbf w}J(\mathbf w_t)$