I have the following question on Comparing the Properties of Gradient vs. Newton Optimization Algorithms
In the context of function optimization, we are told these kinds of general characteristics about both types of algorithms:
Gradient Based Optimization Algorithms: Gradient Based Optimization Algorithms (e.g. Gradient Descent) try to find the minimum of a function by repeatedly evaluating the first derivatives of the function. Evaluating the first derivatives of this function leads the optimization algorithm to the next point to consider as the possible minimum - this first derivative evaluation based search for the minimum of the function continues until some convergence criteria is achieved (e.g. difference in successive searches are less than some threshold).
Newton Based Optimization Algorithms: Newton Based Optimization (e.g. BFGS) Algorithms are similar to Gradient Based Methods, except that the next point they search relies on both the first derivatives and the second derivatives of the function they are trying to minimize.
Part 1: As a result, we can quickly infer the following about both of these algorithms : Since Newton Based Optimization Algorithms evaluate both the first derivatives and second derivatives of the function at each iteration compared to Gradient Based Optimization Algorithms that only evaluate the first derivative of the function - Newton Optimization Methods are expected to be slower compared to Gradient Based Optimization Methods since they typically involve fewer calculations. Thus, in any practical application of function minimization in the real world (e.g. engineering, machine learning), if finding the minimum of the function in a shorter amount of time is considered to be important - it is natural to believe that Gradient Based Methods might display advantages compared to Newton Based Methods in this regard. Although proofs are generally required for every statement in mathematics, it seems very intuitive to accept that the same computer would fundamentally require more time to perform a larger number of calculations (i.e. derivative evaluations) compared to a fewer number of calculations.
Part 2: However (in the spirit of tradeoffs), even though Newton Based Methods generally require more time compared to Gradient Based Methods - Newton Based Methods are said to have the (inherent) ability to provide more "accurate" solutions to function minimization problems compared to Gradient Based Methods. Apparently, this is attributed to the fact that the second derivatives (i.e. the Hessian Matrix) capture more information about the local structure, behavior and curvature of the function being optimized; whereas the first derivatives do not capture any such information on function curvature. Thus, Newton Methods that evaluate first and second derivatives of the function, might have an advantage in this regard compared to the Gradient Based Methods that only evaluate the first derivatives of the function. There are many accounts of Gradient Based Methods "overshooting" the true minimum value of the function as well as getting stuck in "saddle points and local minimums" of the function being optimized (e.g. this lead to improvements on the classic Gradient Based Methods such as ADAM and RMSprop) - whereas (supposedly) Newton Methods, based on the additional information they gain from the second derivatives, have the ability to constantly re-adjust and re-orient themselves based on the local curvature information of the function, and (supposedly) present less of a risk in terms of "overshooting and getting stuck". Thus, if time and computational burden are not significant factors - I have heard that Newton Methods might offer the ability to provide more " (I purposefully refrain from using the word "convergence" due to complex nature of convergence with regards to functions with varying levels of "convexity") compared to Gradient Based Methods.
My Question relates to the theoretical rational and justification of these alleged facts:
Although it is pretty straightforward to informally reason out why first derivative and second derivative information is more useful for optimizing functions compared to only first derivative information (e.g. exploiting knowledge of curvature) - are there any mathematical results that illustrate why this might be true? In general - do "saddle points" present more of an obstacle to optimization algorithms compared to "overshooting", or is it the other way around?
In the same way, although it might be straightforward to informally reason out - are there any mathematical results which illustrate why first derivative information might result in optimization algorithms "overshooting and getting stuck" more than optimization algorithms that use both first and second derivative information?
And finally - we know about the inherent relationship between Taylor Series, Taylor Approximations and Optimization Algorithms: Gradient Based Optimization Algorithms approximate the function of interest and search for the minimum of the function based on a First Order Taylor Expansion (i.e. a First Order Taylor Expansion only uses the First Derivative), whereas Newton Based Methods do the same thing using the Second Order Taylor Expansion. We also know that Higher Order Taylor Expansions have less approximation error compared to Lower Order Taylor Expansions (e.g. Second Order vs. First Order). Thus, do we have any reasons to believe that if some optimization algorithm was to approximate and search for the minimum of the function of interest using a Third Order Taylor Expansion (i.e. Third Derivative Information) - could this possibly present even more advantages (e.g. less overshooting, getting stuck) compared to optimization algorithms that only use First and Second Derivative information? Or do we have reasons to believe that any information obtained from beyond the second derivative would provide diminishingly less information and be marginally less effective at finding the function's minimum (with regards to additional resources required in evaluating this additional derivative information)?
Can someone please comment on this?
Thanks!
I am not able to answer all of your questions, but would nevertheless like to emphasize some points:
1.) To compare the algorithms, the term "convergence" is of urgent need and not to be neglected. E. g., Newton's iterative method gives locally quadratic convergence speed for sufficiently continuous functions, gradient descent does not.
2.) Computationally, it is not relevant whether you evaluate one derivative or two. Concerning speed, this yields a constant factor (say 2), which vanishes when considering Landau notation. In other words, $O(2*n^{1/2})$ grows in the limit slower than $O(n)$, even despite the constant factor. Note that the best algorithm in Big O notation is not always the best one in practice (e. g. one usually uses Quicksort instead of mergesort despite worse worst-case behaviour, because the constants are better.), but this is very often the case, so often that Big O notation is THE standard tool to compare algorithm speed.
3.) Higher derivatives are used in a similar field: when approximating ODEs. See Runge-Kutta methods.