In neural networks, we try to find successively better weights for the network by trying to minimize some error function. Gradient descent can be used. The error function may not be convex with respect to the weights of the network.
Recently I came across the Newton's method for finding a root of a function used for minimizing the error function by means of approximating the function with quadratic function and minimizing that in each step.
Let $f$ be the error function to minimize. The Newton's method goes like this. Approximate $f$ with second order Taylor polynomial.
$f(a_1+x) = f(a_1) + f'(a_1)(x-a_1)+\frac{f''(a_1)}{2}(x-a_1)^2$
According to the method, the update is as follows.
$x=-\frac{f'(a_1)}{f''(a_1)}$
and so
$a_2 = a_1 -\frac{f'(a_1)}{f''(a_1)}$
If $f$ is concave around $a$, then the update can be in the direction of local maximum instead of local minimum.
My question is, how can this method ever work, if we find ourselves in a concave part of the error function? Then the Newtons method will direct us to the local maximum, which is definitely undesirable.
How is this case handled in neural networks? Am I missing something?