gradient descent works on the equation of mean squared error, which is an equation of a parabola y=x^2
we often say weight adjustment in a neural network by gradient descent algorithm can hit a local minima and get stuck in there.
My question is, how is local minima possible on the equation of a parabola, where the slope is always parabolic !
The behavior is parabolic close to a minimum, but there can be as many minima as you want !
Think of a total least-squares line fitting problem where there are just four points forming a square. By symmetry, there must be several solutions (diagonals or medians), and there will be several local minima.