Gradient descent scaling

134 Views Asked by At

If gradient descent converges with a learning rate of 0.1 for f(x), should the same learning rate work for g(x) = f(10x)?

I think that g changes more quickly than f, so want to take smaller steps. The norm of the gradient of g will be 10 times the norm of the gradient of f, so we want the step size to be less than a tenth of 0.1?

Is this reasoning correct?

1

There are 1 best solutions below

1
On BEST ANSWER

Suppose you're happy with step size(*) $\alpha$ for $f$, and you want to find the equivalent step size $\beta$ for gradient descent on $g$.

In other words, for an arbitrary point $x_0$ you want: $$f[10x_0 + \alpha \nabla f(10x_0)] = g[x_0 + \beta \nabla g(x_0)].$$

Since $\nabla g(x_0) = 10\nabla f(10 x_0)$, plugging in to the right-hand side gives $$f[10x_0 + \alpha \nabla f(10x_0)] = f[10x_0 + 100\beta \nabla f(10 x_0)]$$ and you want $\beta = \alpha/100.$

Intuitively, $g$ changes 10 times faster than $f$, so (1) the gradient of $g$ is 10 times larger than that of $f$, and (2) a same-size step causes ten times more change in $g$ than in $f$. Both of these factors together mean you want the step size for $g$ to be 100 times smaller than $\alpha$.

Or put another way, the stable step size for gradient descent is determined by the reciprocal of the Lipschitz constant of $\nabla f$, and the Lipschitz constant of $\nabla g(x) = 10\nabla f(10x)$ is $10\cdot 10$ times larger than that of $\nabla f$.

(*): Gradient descent has been around for over half a century; "learning rate" is an ML-ism for what used to be called the step size.