My question is: when is the aprroximation of Hessian matrix $H=J^TJ$ reasonable?
One truth is that it is reasonable to approximate Hessian with first order derivatives (jacobian), i.e., $H=J^TJ$ when we are solving a non-linear least square problem (which is called Gauss-Newton method). In other words, that is the case when the cost function (energy function) is in quadratic form. This can be derived from Newton's method. See wiki: https://en.wikipedia.org/wiki/Gauss–Newton_algorithm
But is there any other cases when the aprroximation of Hessian matrix $H=J^TJ$ reasonable? Such as, for some of the general non-linear optimization problems?
I have noticed that some papers (in the field of Computer Vision) used Gauss-Newton or Levenberg–Marquardt (L-M) algorithm to solve non-linear non-least-square (i.e. general non-linear optimization) problems. (Which, in fact, is using the approximation of $H=J^TJ$) But none of them have actually explained why it is reasonable.
I have used this strategy in my own research too, and the experiments proved it to be efficient. But still, I don't know how to justify the hessian approximation mathemetically. (And I was asked by a reviewer in my recent journal paper submission.)
So again, is there any hints about how to justify the aprroximation of Hessian matrix $H=J^TJ$ for some of the general non-linear optimization problems?
Thank you very much for your kind help!


As you noted, when solving $ \min_\phi \frac{1}{2}||y - f_\phi||^2 $, the use of $H\approx J^TJ$ is reasonable (following from the Taylor expansion of the objective).
This is considered in Why is the approximation of Hessian=$J^TJ$ reasonable?
So, you didn't post any examples of computer vision papers using this approximation, but I'm going to take a stab in the dark regarding a closely related result.
The reason is that many problems in computer vision can be written as optimizing the log-likelihood function $\mathcal{L}(\theta|X)=\log p(X|\theta)$ of a model with parameters $\theta$ given data $X=(x_1,\ldots,x_n)$. Having observed some $x$, we can estimate a log-likelihood $\log p(x|\theta)$. E.g., given some pixels and an explanatory model, we can assume Gaussian noise so that the log-likelihood is then the negative squared error (see here or here). So to fit a model, we usually optimize something like $$\min_\theta \mathbb{E}_X\left[\mathcal{L}(\theta|X)\right] \approx \min_\theta \frac{1}{n} \sum_i \log p(x_i|\theta). $$
So we essentially want to optimize $L_\theta(x) = \log p(x|\theta)$ with respect to $\theta$. The gradient and Hessian are given by $$ v_x(\theta)=\nabla L_\theta(x) \;\;\;\;\&\;\;\;\; \mathcal{H}_x(\theta) = H[L_\theta(x)] \;\text{ where}\; \mathcal{H}_{ij}=\frac{\partial v_j}{\partial\theta_i} =\frac{\partial^2 L_\theta}{\partial\theta_i\partial\theta_j} $$ which we can use for minimization. Let $v_x(\theta) \in\mathbb{R}^{n\times 1}$ here.
See also here and here. The relationship to "information" is because the Fisher information matrix $\mathcal{I}(\theta)$ is closely related to the Hessian $H(\theta)$. Note the model must be correctly specified.
Its use in optimization for maximum likelihood model fitting is discussed in e.g. Mai et al, On Optimization Algorithms for Maximum Likelihood Estimation. Since (1) we use a Monte Carlo estimate of the expectation, (2) the model is likely mis-specified (meaning even at the optimum the equality will not hold), and (3) during optimization $\theta$ may be far from the optimum anyway.