I am currently a bit puzzled by how mini-batch gradient descent can be trapped in a saddle point?..
The answer might be too trivial, but i don't get it...
You get an new sample every epoch, and it computes a new error based on a new batch, so the cost function is only static for each batch, which means that the gradient also should change for each mini batch.. but according to this should a vanilla implementaton have issues with saddle points?
Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. [19] argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.
I would mean that especially sgd would have clear advantage against saddle points, as it fluctuates towards its convergence... The fluctuance and the random sampling , and cost function being different for each epoch should be enough reasons for not become trapped in one.
For full batch gradient decent does it make sense that it can be trapped in saddle point, as the error function is constant.
So how is the first two method able to be trapped in saddlepoint?
What am I missing?
I remember seeing a lecture with a "cocky" title in the lines of "Why be scared of saddle points when using SGD?" which should have discussed the topic.. can't seem to find it though.
I think you're right that, generally speaking, saddle point extrema can be reasonably well-handled by SGD, but I think this may well depend on the specific data and model in question.
As you know, there is a tradeoff: as the minibatch size increases, the gradient is more accurate but less stochastic (less exploratory); as it decreases, the gradient may be far from the true gradient, and thus be wasted computation (but this also allows escaping local extremums). Some methods actually specifically add random noise (e.g. simulated annealing) or use some other technique (e.g. momentum) to help gradient descent escape local extrema in other ways.
Mathematically speaking, the minibatch SGD's expected value is the true gradient; thus, a reasonably sized minibatch in an area of zero gradient should really be expected to have close to zero gradient as well. It may well return a small (though non-zero), randomly oriented gradient, sort of "wandering around" the saddlepoint. As the quote says, the particularly dangerous case is when the gradient is close to zero in a fairly large area; in such cases, the standard minibatch SGD won't be able to escape, and you need other methods (e.g. adaptive ones) to escape. In many cases, the algorithm won't get stuck in a saddlepoint, it will just be severely slowed down.
Of course, true local minima are probably more dangerous than saddlepoints (since the gradient will be pushed back into the local optima if the area is large enough). However, they are very rare! Especially in high dimensions, most local extrema are saddepoints. This is intuitively reasonable: far from the true optima (assuming it exists), there is no reason to assume that the parameters and the underlying model should all conspire together to reduce the objective function. The work by Dauphin et al has much more to say on this.