I would like to ask first if the second order gradient descent method is the same as the Gauss-Newton method.
There is something I didn't understand. I read that with the Newton's method the step we take in each iteration is along a quadratic curve in $R^n$ rather than along a straight line (steepest descent). Can anyone explain more clearly this statement?
Many Thanks
The Gauss-Newton method is an approximation of the Newton method for specialized problems like
$$ \underset{\mathbf{x}}{\operatorname{argmin}}\;\mathbf{r}(\mathbf{x})^T\mathbf{r}(\mathbf{x}) $$
In other words, it finds a solution $\mathbf{x}$ that minimizes the squared norm of a nonlinear function $||\mathbf{r}(\mathbf{x})||_2^2$.
If you look at the update step for gradient descent and Gauss-Newton applied to the equivalent problem $\frac{1}{2}\mathbf{r}(\mathbf{x})^T\mathbf{r}(\mathbf{x})$, the relationship becomes clear:
Gradient descent
$$ \begin{align} \mathbf{x}_{n+1} &= \mathbf{x}_n - \mu \Delta(\frac{1}{2}\mathbf{r}(\mathbf{x_n})^T\mathbf{r}(\mathbf{x_n})) \\ &= \mathbf{x}_n - \mu\mathbf{J}_r^T\mathbf{r}(\mathbf{x}_n) \end{align} $$
Gauss-Newton
$$ \begin{align} \mathbf{x}_{n+1} = \mathbf{x}_n - (\mathbf{J}_r^T\mathbf{J}_r)^{-1}\mathbf{J}_r^T\mathbf{r}(\mathbf{x}_n) \end{align} $$
The structure of the problem enables the approximation of the Hessian used in Newton's method as $\mathbf{H} \approx \mathbf{J}_r^T\mathbf{J}_r$. As you said, the method jumps to the minimum of the second order Taylor-approximation around $\mathbf{x}_n$ in every step.
The qualitative behavior in the neighborhood of a solution is that the approximated second-order (curvature) information allows for convergence along a more direct, less "zigzaggy" path. It also converges faster than gradient descent. Imagine how the region that is approximated as a quadratic function (the one that you "jump across" in an iteration) becomes smaller and smaller. In turn, that approximation becomes more and more accurate for a sufficiently smooth function.
However, if the initial guess is far away from a solution, the (approximated) Hessian can become ill-conditioned. The resulting correction-vector is not guaranteed to point in the general direction of descent anymore (if the angle between it and the steepest descent is larger than 90°, the method actually diverges).