I can understand that it will take a computer more time to calculate, approximate and invert the Hessian Matrix (of some function) compared to the Jacobian Matrix - apparently, it is this fact which made the Gradient Descent algorithm so popular: Even though Gradient Descent only linearly approximates the target function, this linear approximation is much quicker to work with compared to the (better quality quadratic approximation provided by an) algorithm that uses a Hessian Matrix. This has given algorithms that use the Hessian Matrix a "bad reputation".
Nonetheless, there are some still optimization algorithms that work using the Hessian Matirx:
Conjugate Methods : Unlike Steepest Descent (i.e. Gradient Descent), Conjugate Methods are said to not search for the next point in the direction of the target functions gradient (evaluated at the current point), but rather in the "Conjugate Direction" - supposedly this makes the Conjugate Function more "robust" to any starting point (e.g. this is apparently relevant in Non Convex Optimization) as searching for the function's minimum in the direction of it's current derivative is not always ideal ... however, I don't fully understand the mathematical justification behind this)
Trust Region Optimization (supposedly this is frequently used in Reinforcement Learning) : Trust Region Optimization approximates the target function over a "circular region" instead of a "linear region", and the next point to evaluate this target function is identified within this "circular region". Just by thinking out loud, I can guess that maybe Trust Region Optimization is searching over a larger region compared to Non-Trust Region Optimization, thus making it more likely to find a better quality stationary point, but I am not sure if this potential benefit is worth the definite computational sacrifice, and how relevant is this in non convex optimization.
Does anyone know why the above optimization algorithms use the Hessian Matrix (and have apparently had great results on complicated/non-convex functions encountered in the real world), despite the Hessian Matrix being so difficult to calculate?
- In other words, why do Conjugate Methods and Trust Region Optimization work so well despite the fact that they involve dealing with the Hessian?
Thank you!