What Exactly is Step Size in Gradient Descent Method?

1.4k Views Asked by At

Gradient descent is numerical optimization method for finding local/global minimum of function. It is given by following formula: $$ x_{n+1} = x_n - \alpha \nabla f(x_n) $$ There is countless content on internet about this method use in machine learning. However, there is one thing I don't understand and which I couldn't find even though it is basic.

What exactly is step size $ \alpha $ ?

Wikipedia states that it is tunning parameter in optimization algorithm which I understand, but not enough is being said about it to be considered a definition. Dimension analysis states that its dimensions should be $\frac {(\Delta x )^2} {\Delta y}$ which I am not sure how to interpret.

1

There are 1 best solutions below

0
On

One way to picture it, is that $\alpha$ is the "step size" of the discretization for the differential equation $$ \frac{dx(t)}{dt} = -\nabla f(x(t)) $$ Lets first analyze this differential equation. Given an initial condition, $x(0)\in\mathbb{R}^n$, the solution to the differential equation is some continuous time curve $x(t)$. What property does this curve have? Lets compute the following quantity, the total derivative of $f(x(t))$: $$ \frac{df(x(t))}{dt} = \nabla f(x(t))\cdot \frac{dx(t)}{dt} = -\nabla f(x(t))\cdot \nabla f(x(t)) = -\|\nabla f(x(t))\|^2 <0 $$ This means that whatever the trajectory $x(t)$ is, it makes $f(x)$ to be reduced as time progress! So if our goal was to reach a local minimum of $f(x)$, we could solve this differential equation, starting from some arbitrary $x(0)$, and asymptotically reach a local minimum $f(x^*)$ as $t\to\infty$.

In order to obtain the solution to such differential equation, we might try to use a numerical method / numerical approximation. For example, use the Euler approximation: $$ \frac{dx(t)}{dt} \approx \frac{x(t+h)-x(t)}{h} $$ for some small $h>0$. Now, lets define $t_n := nh$ with $n=0,1,2,\dots$ as well as $x_n := x(nh)$. Hence, we can write our approximate differential equation as: $$ \frac{x_{n+1} -x_n}{h} = -\nabla f(x_n) $$ or equivalently: $$ x_{n+1} = x_n -h\nabla f(x_n) $$ So, this algorithm is the discrezation of the continuous time differential equation we had before. Ideally, if $h\to 0$ we would obtain the nice property that $f(x)$ always decreases along the trajectory $x(t)$. Hence, for sufficiently small $h$, and sufficiently regular $f$, the sequence $\{x_n\}_{n\geq 0}$ will comply the same property: $f(x_n)$ should decrease at each step.

How should I choose $h$ ($\alpha$ in your case)? It depends, for some particular $f(x)$ you may be able to compute the maximum value of $h$ until the condition $f(x_{n+1})<f(x_n), \forall n\geq 0$ no longer holds. However, in most cases, you know that this works for sufficiently small $h$ and you will need to find a suitable one by trial and error.

Regarding units, the step size has whatever units are needed to make sense of the algorithm. On this case they are not important. This an algorithm for which someone decided its dynamics and not a physical/natural phenomenon.

I hope this helps!