Why does gradient descent make sense?

513 Views Asked by At

Suppose I define two functions of $x$ in terms of a convex function $f$ with a unique minimum $x_0$:

$$f_1(x) = 1 \times f(x)$$

$$f_2(x) = 2 \times f(x)$$

Suppose I wanted to minimize each of these functions numerically.
Clearly, the minima of both occur at the same $x_0$.
But if I tried to do gradient descent, I would be performing the updates

$$x \gets x - \lambda \nabla f_k(x)$$

and therefore step sizes would be larger for $f_2$ than $f_1$.
To me, this doesn't make sense. The step sizes should not depend on the scaling factor!
It therefore seems to imply that gradient descent should be normalized such that the step sizes have constant magnitude regardless of the gradient's magnitude.

So what justification is there for having the step size increase with the magnitude of the gradient when it's obvious that scaling the original function will change the gradient but not the location of its extrema?

4

There are 4 best solutions below

4
On

If your step size $\lambda$ is small enough, then when you update $$x_{t+1} = x_t - \lambda \nabla f_k(x)$$ you can ensure that $f_k(x_t) \geq f_k(x_{t+1})$. At least for the function you performed a gradient descent step on. So, if $\lambda$ is small enough for $f_1$, maybe it's not small enough for $f_2$.

The normalization you're talking about is something like one of the methods based on newton's method or conjugate gradient.

4
On

That's one of the reasons that the gradient decent is not the best method. This issue is resoved by the Newton method.

Let $H_k(x)=\nabla^2f_k(x)$ be the Hessian of $f_k(x)$, then by the Newton method your updates will be:

$$x_{t+1}=x_t-\lambda H_k(x_t)^{-1}\nabla f_k(x_t)$$

Now for example, if $f_k(x)=kf(x)$, then $\nabla f_k(x)=k\nabla f(x)$, and $H_k(x)=k\nabla^2 f(x)$, (and $H_k(x)^{-1}=\frac{1}{k}(\nabla^2 f(x))^{-1}$) therefor:

$$x_{t+1}=x_t-\lambda H_k(x_t)^{-1}\nabla f_k(x_t)=x_t-\lambda \frac{1}{k}(\nabla^2 f(x_t))^{-1}k\nabla f(x_t)=x_t-\lambda (\nabla^2 f(x_t))^{-1}\nabla f(x_t)$$

which does not depend on $k$.

0
On

Your intuition is correct. Outside of a narrow window around the optimum, the magnitude of the gradient is actually irrelevant to the step size. It is just as easy to imagine a steep gradient towards a nearby optimum, or a shallow gradient to a distant optimum, as it is to imagine a steep gradient to a far off optimum or a shallow gradient to a nearby optimum.

The distance to the optimum is only proportional to the gradient's magnitude when the parameter's value is already within a neighborhood of the optimum small enough to contain no inflection points, which is not generally something we can know in advance if we are using gradient descent. That is why it is necessary to multiply the gradient by a small (or even diminishing) step size.

I have personally found, in developing my own algorithms, that it is possible to simply throw away the gradient's magnitude and use a step size which depends only on the number of updates. This is a scale-invariant and more robust approach, which, as a bonus, bypasses the issues of "vanishing" and "exploding" gradients.

0
On

I intend to give some glimpses to this old thread, like a more detailed one I gave here.

Let us consider the minimization problem $$g({\bf a})=\min_{\textbf{x}\in A}{g(\textbf{x})}$$ to some continuously differentiable function $\textbf{g}:A\to \mathbb{R}$, where $A$ is an open set of $\mathbb{R}^m$ containing $\textbf{a}$. Now, if you have some differentiable curve $\textbf{u}:(a,b)\to A$, you can apply the chain rule to obtain $$\frac{d\, g({\bf u}(t))}{dt}= \left\langle {\bf u}'(t), \nabla g({\bf u}(t))\right\rangle,$$ in which $\langle \cdot,\cdot\rangle$ denotes the inner product.

A natural choice to ${\bf u}(t)$ is given by the the initial value problem (IVP) $$\left\{\begin{array}{rl}{\bf u}'(t)&=&-\alpha \nabla g({\bf u}(t))\\ {\bf u}(0)&=&{\bf u}_0\end{array}\right.,$$to some $\alpha>0$.

But, if you choose the curve ${\bf u}(t)$ given by the IVP $$\left\{\begin{array}{rl}{\bf u}'(t)&=&-\beta(t) \nabla g({\bf u}(t))\\ {\bf u}(0)&=&{\bf u}_0\end{array}\right.,$$ to some $\beta(t)>0$ (in a way that the ${\bf u}(t)$ exists). You has the inequality $$\frac{d\, g({\bf u}(t))}{dt}= -\beta(t)\|\nabla g({\bf u}(t))\|^2\leq 0,$$ and $g({\bf u}(t))$ is nonincreasing.

When you apply Euler's method to solve this last IVP numerically you find the gradient descent method with variable step size, for instance.

Otherwise, if you let $s=\rho(t)$, with $\rho(0)=0$ and $\rho'(t)=\beta(t)$, then $$\left\{\begin{array}{rlrr}{\bf u}'(s)&=&\displaystyle\frac{d\,{\bf u}}{dt}\frac{dt}{ds}&=&- \nabla g({\bf u}(s))\\ {\bf u}(0)&=&{\bf u}_0\end{array}\right..$$

When I was trying to find some results using "(\dot{u}=-\beta(t) \nabla g(u))" on
SearchOnMath, and I found this interesting arxiv file.