Proof of orthogonality in the gradient descend algorithm.

939 Views Asked by At

Ok, this is perhaps an easy question but I'm stuck, so any help will be cherished.

The gradient descent algorithm updates the weights as:

$$\textbf{w}_{t+1} = \textbf{w}_{t} - \eta\nabla E(\textbf{w}_{t}) $$

for a function $E(\textbf{w})$ to minimize.

I have read, but I can't prove it, that one of the reasons (among others) that make inefficient this algorithms is that makes a zigzag path of descent because $\nabla E(\textbf{w}_{t+1})^{T}\nabla E (\textbf{w}_{t}) = 0 $.

Why is this true?

I think if I can understand this I would understand better the classical momentum and Nesterov techniques.

Thanks.