In optimisation and elsewhere, it is of interest to approximate the Hessian of a function $f$, given only evaluations of $\nabla f$ at some collection of points $\{ x^i \}_{i = 1}^N$. That is, given data $D_N = \{ ( x^i, \nabla f ( x^i ) ) \}_{i = 1}^N$, one forms a matrix $H$ such that for $x$ near $\{ x^i \}_{i = 1}^N$, $H \approx \nabla^2 f ( x )$.
The approximation which I'm most aware of is the BFGS scheme, which arises as a Quasi-Newton method in optimisation. One aspect of this scheme is that it processes the data $D_N$ sequentially, that is, it requires the interpretation that the pairs $( x^i, \nabla f ( x^i ) )$ arrived in some meaningful order (or, it requires making an arbitrary choice of ordering). Moreover, the scheme forms a sequence of approximate Hessians $\{ H^i \}_{i = 1}^N$, such that (roughly speaking), for each $i$, one has that $H^i \approx \nabla^2 f ( x^i )$.
I appreciate that the BFGS scheme is used for the purposes of optimisation, and is not designed to form a global approximation of the Hessian. I understand also that this is basically a good thing; as one is only trying to find a minimiser of $f$, a local approximation is more or less sufficient. In any case, the first point I want to make is that given data $D_N$, which do not naturally admit a sequential structure, BFGS is arguably an unnatural procedure for forming a Hessian approximation.
So, suppose the following:
- I have the data $D_N$.
- I want to form an approximation of the Hessian which is valid for $x$ near $\{ x^i \}_{i = 1}^N$.
- I want my approximation $H = H ( D_N )$ to be invariant under the ordering of the set $D_N$, i.e. the approximation procedure doesn't care which order the data arrived in.
My question is: Is there a standard, efficient procedure for approximating the Hessian of a function from data, which obeys condition (3) above?
Note that BFGS fails condition 3 in general.
My naive attempt at describing such a procedure was to define
\begin{align} \hat{H} = \text{argmin}_H \left\{ \sum_{i \neq j} \| \left[ \nabla f ( x_i ) - \nabla f ( x_j ) \right]- H \left[ x_i - x_j \right]\|_2^2 \right\} \end{align}
or something of this form, i.e. encourage the matrix $H$ to `interpolate' the gradients, and then perhaps pick the simplest such $H$ (simplest w.r.t. e.g. nuclear or $\ell^1$ norm). But this is not obviously possible to solve efficiently, especially for high-dimensional systems.
A related question is the following:
- I have the data $D_N$, and the ordering of the data is a priori meaningful.
- I want to form an approximation of the Hessian which is valid for $x$ near $\{ x^i \}_{i = 1}^N$.
- I want my approximation $H = H ( D_N )$ to be invariant under the direction of time, i.e. suppose the data was instead given as
\begin{align} D'_N &= \{ ( y^i, \nabla f ( y^i ) ) \}_{i = 1}^N \\ \text{where } y^i &= x^{N + 1 - i}. \end{align}
Then, I want that $H (D'_N) = H ( D_N )$.
As before, my question is whether such a procedure exists, can be implemented efficiently, etc. Again, note that BFGS fails this new condition 3 in general.