Why do we need sub-gradient methods for non-differentiable functions?
Consider optimizing $f(x) = max_{i} (a_{i}^Tx+b_{i})$. Clearly this is non-differentiable at multiple points, and the conventional way is to solve this using sub-gradient descent.
We know that this function however, is differentiable almost everywhere. Can't we use normal gradient descent, and if we end up at a non-differentiable point, simple perturb the value of x by $\epsilon$ and compute the gradient?
Non-differentiability of $f(x)$ can cause problems even in situations where a gradient descent method never encounters a point where the gradient isn't defined. For example, try gradient descent on $f(x)=|x|$ using any fixed step size. For most starting points, the method will not converge even though it never hits the single bad point $x=0$ where $f'(x)$ isn't defined.