Ok, this is perhaps an easy question but I'm stuck, so any help will be cherished.
The gradient descent algorithm updates the weights as:
$$\textbf{w}_{t+1} = \textbf{w}_{t} - \eta\nabla E(\textbf{w}_{t}) $$
for a function $E(\textbf{w})$ to minimize.
I have read, but I can't prove it, that one of the reasons (among others) that make inefficient this algorithms is that makes a zigzag path of descent because $\nabla E(\textbf{w}_{t+1})^{T}\nabla E (\textbf{w}_{t}) = 0 $.
Why is this true?
I think if I can understand this I would understand better the classical momentum and Nesterov techniques.
Thanks.