I read the paper Towards Evaluating the Robustness of Neural Networks by Carlini
there is a phrase:
the algorithm Clipped gradient can get stuck in a flat spot where it has increased some component x_i to be substantially larger than the maximum allowed
I don't understand, why Clipped gradient can get stuck in a flat spot?
The full context, page 7:
$$ x + δ \in [0,1]^n $$ 2) Clipped gradient descent does not clip
x_ion each iteration; rather, it incorporates the clipping into the objective function to be minimized. In other words, we replacef(x + δ)withf(min(max(x + δ, 0), 1)), with the min and max taken component-wise. While solving the main issue with projected gradient descent, clipping introduces a new problem: the algorithm can get stuck in a flat spot where it has increased some componentx_ito be substantially larger than the maximum allowed. When this happens, the partial derivative becomes zero, so even if some improvement is possible by later reducingx_i, gradient descent has no way to detect this.
I think it means the following: denote some component of the gradient $[\nabla f]_i=\partial_i f$, and consider the situation mentioned for some $x$, where $x_i>>1$. Then, consider the partial derivative: $$ \partial_i f(x)\approx\frac{f(x+\Delta)-f(x)}{\Delta_i}=\frac{f(x)-f(x)}{\Delta_i} = 0 $$ where $\Delta=(0,\ldots,0,\Delta_i,0,\ldots,0)$ and the second equality is because $\Delta_i$ is small, so $$ f(x+\Delta)=f(\min(\max(x+\Delta,0),1))=f(x_1,\ldots,x_{i-1},1,\ldots,x_n)=f(x) $$ This means that the algorithm will simply get stuck there (for that dimension), since that component of the gradient will stay at $0$.