Why do we need sub-gradient methods for non-differentiable functions?

5.1k Views Asked by At

Why do we need sub-gradient methods for non-differentiable functions?

Consider optimizing $f(x) = max_{i} (a_{i}^Tx+b_{i})$. Clearly this is non-differentiable at multiple points, and the conventional way is to solve this using sub-gradient descent.

We know that this function however, is differentiable almost everywhere. Can't we use normal gradient descent, and if we end up at a non-differentiable point, simple perturb the value of x by $\epsilon$ and compute the gradient?

4

There are 4 best solutions below

4
On BEST ANSWER

Non-differentiability of $f(x)$ can cause problems even in situations where a gradient descent method never encounters a point where the gradient isn't defined. For example, try gradient descent on $f(x)=|x|$ using any fixed step size. For most starting points, the method will not converge even though it never hits the single bad point $x=0$ where $f'(x)$ isn't defined.

0
On

There are a number of methods for optimizing a function that is not differentiable at some points. Sub-gradient methods (e.g., sub-gradient projection) have been tailored for efficient implementation and have been shown to work on a wide variety of optimization problems. Yes, you can try to graft an ad-hoc "random perturbation" method onto a traditional gradient-descent algorithm, but there are perils in making the "wrong" perturbation (viz., wrong dimension), particularly in high dimensional search spaces.

Why not simply use one of the well-tested methods based on sub-gradients?

7
On

If you study the convergence proof of GD you will notice that differentiability is usually not the only requirement but also Lipschitz continuity of the gradient. The Lipschitz constant is very important when choosing a fixed stepsize.

0
On

Constant step-size can only be used when the function has $L-$Lipschitz gradient (step-size can be chosen as 1/L). However, if gradient of the function is not Lipschitz but only continuous, we can use backtracking line-search. Proposition 1.2.1 in the book Nonlinear Programming of Dimitri Bertsekas already mentioned the convergence for gradient methods using this step-size together with a broad class of approximate directions (not need to be exact gradient).

For non-differentiable functions, when we use backtracking line-search, even if all the iterative points are differentiable, the algorithm can converge to a non-stationary point, see Figure 1 in this paper https://sites.math.washington.edu/~burke/papers/reprints/96-gs_highlights.pdf with Rosenbrock function.

However, for your question, I believe that the answer is yes, if we use diminishing step-size (or square summable but not summable one). You may try some numerical experiments for this. Sorry but I do not know any theoretical result in the official mathematics paper about it. But again, I believe that the answer is yes.