Complexity of numerical derivation for general nonlinear functions

66 Views Asked by At

In classical optimization literature numerical derivation of functions is often mentioned to be a computationally expensive step. For example Quasi-Newton methods are presented as a method to avoid the computation of first and/or second derivatives when these are "too expensive" to compute.

What are the state of the art approaches to computing derivatives, and what is their time complexity? If this is heavily problem-dependent, I am particularly interested in the computation of first and second order derivatives for Nonlinear Least Squares problems, specifically the part concerning first order derivatives (Jacobians).

1

There are 1 best solutions below

0
On BEST ANSWER

The case considered here is one scalar function in many variables. Or additionally a few additional function for the constraints.

  • In a numerical computation using difference quotients, each derivative "costs" about 1 to 3 extra function evaluations, if all values used for the different derivatives are on the same grid. But then you have also to take catastrophic cancellation into account which gets rapidly worse for higher order derivatives. Using an order-adapted step size gets around that to some extent, but results in even more function evaluations.

  • Using AD (automatic/algorithmic differentiation) the gradients can be computed in the backwards mode with the cost of 1 to 3 additional function evaluations (and the caching of all intermediate steps). The second derivative can be computed with a combination of forward and backwards sweeps, where each variable adds a unit cost to the total, meaning that the total is about $n$ to $3n$ function evaluations. The efficient computation of higher order derivatives is an on-going research topic, esp. in how to manage the redundancy due to the symmetry of the derivative tensors. Using the propagation of univariate Taylor polynomials in several directions is more efficient in the computation but has a condition number problem in the reconstruction of the partial derivatives.

So if the dimension $n$ is large, evaluating a gradient of a scalar function costs around $2$ function evaluations with AD and $2n$ using difference quotients, which appears to some still bad enough that there is a cottage industry of "derivative-free" methods. Frequent evaluation of the $n^2/2$ large second derivative will be not competitive if it can be avoided or only be used for rare (re-)initialization in updating methods like BFGS.

PS: That gradient computations via backwards sweep is so cheap is the reason it was invented several times independently, among others in neural network learning algorithms as back-propagation.