According to many resources, TR algorithms often suffer from bad scaling. The simplest remedy is to use scaling matrix D in following way
\begin{align} \min_d \ f + g'd + \frac{1}{2}*d'Bd \\ \text{subject to} \ ||Dd|| <= \Delta , \end{align}
where D is diagonal matrix with positive diagonal elements.
My question is, how to construct such a matrix? which properties should it fulfill?
JJ Moré's paper Recent Developments in Algorithms and Software for Trust Region Methods provides guidance for choosing the diagonal scaling matrix.
He shows that the convergence theorems for trust region optimization still hold provided the diagonal scaling matrices $D_k$ satisfy these properties:
(4.6) $\left\lVert D_k^{-1} \right\rVert \le \sigma_2$ for some $\sigma_2\gt0$
(4.12) $\left\lVert x_k-x_m \right\rVert \le \delta_1$, $k=m,\ldots, l$, $\implies \left\lVert D_l-D_m \right\rVert \le \delta_2$ for $\delta_1,\delta_2\gt0$
If $f:R^n \rightarrow R$ represents the optimization problem, he then proposes choosing
(4.13) $D_k=diag(d_{k,i})$, $d_{k,i}=max(d_{k-1,i}, \left| \partial_{ii}f(x_k) \right|^{1/2})$
I gather the reasoning is
Moré's other paper The Levenberg-Marquardt algorithm: Implementation and theory has numerical results comparing different scaling methods. Although the methods are slightly different and specialized to least-squares optimization, they would correspond to these strategies:
For the problems he tested, his results show that (Adaptive) is the best and (Continuous) is worse than either (Initial) or (Adaptive)