Gradient descent with non-Lipschitz continuous gradients

617 Views Asked by At

In general, we know that for strongly convex functions for which we can compute the Hessian and find the Lipschitz constant $L$ of the gradient, gradient descent will converge provided that the step size $\alpha < 1/L$. This, for example, would trivially hold for gradient descent on $f(x) = x^2$; it is on every textbook in numerical methods. For this class of functions I could also compute the Hessian and run a second-order method (Newton or quasi-Newton depending on the Hessian approximation or computation).

However, let $f(x) = (x^2)_+$ (clipping the square loss for negative values). Then the first derivative is not differentiable anymore (discontinuity in $0$) and $f$ is not twice continuously differentiable. Still, while second order methods cannot be applied, I can run gradient descent.

How would one then choose the step size for gradient descent in this case, since the Hessian is not defined? In my specific case above, I can bound the gradient with a constant, and I can take the derivatives on both sides ignoring the discontinuity. But I cannot show it is Lipschitz-continuous. What are known results in this case?

Note: the true problem I am trying to solve is to run gradient descent on the Huber loss with appropriately chosen step size, but the root question is the same: how do we determine an acceptable step size for a function that is not twice differentiable everywhere, but has bounded gradient? Are there any special conditions that apply to find the step size for "well behaved" functions, even just in the univariate case?