Gradient Descent on Function that is non-differentiable at finitely many points

681 Views Asked by At

Suppose that I have a (possibly non-convex) function that maps $R^N$ to $R$, and is differentiable at all points except finitely many points. I want to use gradient descent algorithm to show convergence to a local optima, $$ x(k+1) = x(k) + \alpha(k) \nabla f |_{x=x(k)}, $$ where $\nabla f |_{x=x(k)}$ is derivative computed at point $x(k)$, $\alpha(k)$ is step-size at iteration $k$. When derivative is not defined, a random step is taken (say Gaussian) with variance $\alpha(k)$.

Assume that (wherever derivative is defined) the norm of derivative is uniformly bounded by $B$, i.e. $\sup_{x} \|\nabla f \| \le B$.

Is it true that the iterates converge to a local optima?

2

There are 2 best solutions below

2
On

Edit : This answers the original question. The question has been edited after this answer was made.

No, this does not work :

Consider $f : x \mapsto \frac{1}{x}$, which isn't differentiable at $0$. For any positive value of $\alpha(k)$, if you take $x(k)$ whose absolute value is small enough (eg. smaller than $\alpha(k)$ and close to $0$), $x(k+1)$ will be a number with HUGE absolute value. I suspect that the sequences formed as described in original post will exhibit a chaotic behavior.

8
On

The result does hold for such a general case, even if the function has only one global maximizer. For example, there is no way to guarantee that the function does not oscillate too much, like the function $g$ such that $g(0)=0$ and $$g(x)=-(\sin(1/x)+1.5)\dfrac{x^{2}}{1+x^2},$$ for all $x\not=0$, does. This function is differentiable everywhere and have bounded derivative. In this setting, the sequence generated by the function $g$ that starts at any of its infinite minimizer around $0$ would stay still at this critical point forever, i.e., $x^{k}=x^{k+1},$ for all $k\in\mathbb{N}$, whenever $x^{0}$ is a minimizer point.


If local optima means maximum or minimum, the function $g:\mathbb{R}\rightarrow\mathbb{R}$ such that $$h(x) = x^3/(1+x^2)$$ for all $x \in \mathbb{R}$ works if you start at the point $x_0=0.$ The sequence generated is the constant one.


If you want a example that the function is bounded and local optima means maximum or minimum, the function $i:\mathbb{R}\rightarrow\mathbb{R}$ such that $$i(x) = x^3/(1+x^4)$$ for all $x \in \mathbb{R}$ works if you start at the point $x_0=0.$ The sequence generated is the constant one.


The fourth and the brighter example, if local optima means maximum or minimum, the function $j:\mathbb{R}\rightarrow\mathbb{R}$ such that $$j(x)=\dfrac{x}{1+(-x+|x|)^2}+\sin\left(\dfrac{x}{1+(-x+|x|)^2}\right),$$ for all $x \in \mathbb{R},$ is an example of differentiable function that it will be quite rare for a starting point $x_0<0$ to be such that the GD method to converge to a local maximum. This function has only one local maximum (it's global maximum as well) and yet have such behavior because of the huge amount of critical points in the negivive axis.


I think it's clear that there is no possible assumption that makes your result holds in such generality.