Approximation of Hessian=$J^TJ$ for general non-linear optimization problems

Question

Approximation of Hessian=$J^TJ$ for general non-linear optimization problems

3.2k Views Asked by Bumbble Comm At 22 Feb 2026 - 7:08

My question is: when is the aprroximation of Hessian matrix $H=J^TJ$ reasonable?

One truth is that it is reasonable to approximate Hessian with first order derivatives (jacobian), i.e., $H=J^TJ$ when we are solving a non-linear least square problem (which is called Gauss-Newton method). In other words, that is the case when the cost function (energy function) is in quadratic form. This can be derived from Newton's method. See wiki: https://en.wikipedia.org/wiki/Gauss–Newton_algorithm

But is there any other cases when the aprroximation of Hessian matrix $H=J^TJ$ reasonable? Such as, for some of the general non-linear optimization problems?

I have noticed that some papers (in the field of Computer Vision) used Gauss-Newton or Levenberg–Marquardt (L-M) algorithm to solve non-linear non-least-square (i.e. general non-linear optimization) problems. (Which, in fact, is using the approximation of $H=J^TJ$) But none of them have actually explained why it is reasonable.

I have used this strategy in my own research too, and the experiments proved it to be efficient. But still, I don't know how to justify the hessian approximation mathemetically. (And I was asked by a reviewer in my recent journal paper submission.)

So again, is there any hints about how to justify the aprroximation of Hessian matrix $H=J^TJ$ for some of the general non-linear optimization problems?

Thank you very much for your kind help!

Original Q&A

There are 2 best solutions below

**Bumbble Comm** · Answer 1 · 2019-06-29 19:16:00

As you noted, when solving $ \min_\phi \frac{1}{2}||y - f_\phi||^2 $, the use of $H\approx J^TJ$ is reasonable (following from the Taylor expansion of the objective).

This is considered in Why is the approximation of Hessian=$J^TJ$ reasonable?

So, you didn't post any examples of computer vision papers using this approximation, but I'm going to take a stab in the dark regarding a closely related result.

The reason is that many problems in computer vision can be written as optimizing the log-likelihood function $\mathcal{L}(\theta|X)=\log p(X|\theta)$ of a model with parameters $\theta$ given data $X=(x_1,\ldots,x_n)$. Having observed some $x$, we can estimate a log-likelihood $\log p(x|\theta)$. E.g., given some pixels and an explanatory model, we can assume Gaussian noise so that the log-likelihood is then the negative squared error (see here or here). So to fit a model, we usually optimize something like $$\min_\theta \mathbb{E}_X\left[\mathcal{L}(\theta|X)\right] \approx \min_\theta \frac{1}{n} \sum_i \log p(x_i|\theta). $$

So we essentially want to optimize $L_\theta(x) = \log p(x|\theta)$ with respect to $\theta$. The gradient and Hessian are given by $$ v_x(\theta)=\nabla L_\theta(x) \;\;\;\;\&\;\;\;\; \mathcal{H}_x(\theta) = H[L_\theta(x)] \;\text{ where}\; \mathcal{H}_{ij}=\frac{\partial v_j}{\partial\theta_i} =\frac{\partial^2 L_\theta}{\partial\theta_i\partial\theta_j} $$ which we can use for minimization. Let $v_x(\theta) \in\mathbb{R}^{n\times 1}$ here.

Then the Information Matrix Equality holds, written $$ -\mathbb{E}_x[\mathcal{H}_x(\theta)] = \mathbb{E}_x[v_x(\theta) v_x(\theta)^T] $$ where $\theta$ is the true parameter vector.

See also here and here. The relationship to "information" is because the Fisher information matrix $\mathcal{I}(\theta)$ is closely related to the Hessian $H(\theta)$. Note the model must be correctly specified.

Its use in optimization for maximum likelihood model fitting is discussed in e.g. Mai et al, On Optimization Algorithms for Maximum Likelihood Estimation. Since (1) we use a Monte Carlo estimate of the expectation, (2) the model is likely mis-specified (meaning even at the optimum the equality will not hold), and (3) during optimization $\theta$ may be far from the optimum anyway.

**Bumbble Comm** · Answer 2 · 2019-07-25 20:49:55

It has been noted by many researchers that the Gauss-Newton Hessian often performs much better than theory predicts. For example, one can show that the Gauss-Newton Hessian will perform well when the residual is small, but it tends to perform well when the residual is large, too. Since the Gauss-Newton Hessian is positive definite, one may expect the Gauss-Newton method to outperform Newton's method when moving through regions of space where the true Hessian is indefinite, but often the Gauss-Newton method outperforms Newton's method even regions where the Hessian is positive definite. No one knows why this is (to my knowledge), but the following paper proposes an explanation and has some more general discussion and literature review of the phenomenon:

Chen, Pei. "Hessian matrix vs. Gauss–Newton hessian matrix." SIAM Journal on Numerical Analysis 49.4 (2011): 1417-1435. https://epubs.siam.org/doi/abs/10.1137/100799988

I've been trying to understand this myself. I suspect a complete understanding will come from analysis of the continuous flows associated with the Newton and Gauss-Newton methods. That is, the continuous limit of the (Gauss-)Newton method as the step size goes to zero and the number of steps goes to infinity. Here are some pictures I made comparing the continuous flows associated with the Gauss-Newton method, Newton method, and some others, on a version of the Rosenbrock function that is regularized so that there is a residual at the optimal point.

Approximation of Hessian=$J^TJ$ for general non-linear optimization problems

There are 2 best solutions below

Related Questions in NONLINEAR-OPTIMIZATION

Related Questions in JACOBIAN

Related Questions in HESSIAN-MATRIX

Related Questions in COMPUTER-VISION

Trending Questions

Popular # Hahtags

Popular Questions