Suppose $f:\mathbb{R}^2\to\mathbb{R}$ is a smooth function. Suppose that disturbing one variables changes the function value much more than disturbing the other variable, $$ |f(x+\varepsilon,y)-f(x,y)| \gg |f(x,y+\varepsilon)-f(x,y)| \,. $$ This poor scaling property challenges computer solvers to find the roots of $f$, see here for Matlab. The matlab team suggests to rescale the function such that
''each coordinate has about the same effect on the objective function''.
Example (from Mathwork). A poorly scaled function is $f(x,y)= 10^6 x^2 + 10^{-6}y^2$. Instead, we can try to solve $g(x,y)=0$ where $g(x,y)= f\left(\begin{pmatrix} 10^{-3} & 0 \\ 0 & 10^3 \end{pmatrix}\begin{pmatrix}x \\y\end{pmatrix}\right)=f(10^{-3}x,10^3y)$.
Question: Is there a more general approach to rescale a function? In Mathwork's example, $\begin{pmatrix} 10^{-3} & 0 \\ 0 & 10^3 \end{pmatrix}$ is pretty obvious but how could one come up with a suitable rescaling if the problem isn't that obvious? Is there a general hint on how I can equalize the effect of input variables on the function value?
While my question is a bit more general, it originates from my problem of finding the roots of \begin{align*} f(x,y) = \begin{pmatrix} \left(\frac{c_1c_3 }{c_1-c_2}\left(\frac{c_2 c_3}{(y-c_6)(c_1-c_2)}\right)^{-\frac{c_2}{c_1}} +c_6 \right) \frac{c_2}{c_1}x^{c_2-c_1}+\frac{x^{1-c_1}}{c_1c_5}-y \\ \left(\frac{c_1c_3 }{c_1-c_2}\left(\frac{c_2 c_3}{(y-c_6)(c_1-c_2)}\right)^{-\frac{c_2}{c_1}} +c_6\right) x^{c_2}+ c_5x - c_7-yx^{c_1} \end{pmatrix}, \end{align*} where all the $c_i$ are a bunch of constants. I asked Matlab to minimise $\|f\|$ but that turned out to be poorly scaled.
Here's an ad-hoc idea for if you function is Lipschitz-smooth.
In a sense, the issue of scaling stems from the fact that you've got very different Lipschitz constants in each component of your function. In particular, I mean that if we look at the restriction of $f$ in its first component and guesstimate its Lipschitz constant $L_x$, that number may be very different than if we looked at the restriction of $f$ in its second component $L_y$. To alleviate this issue, if we rescale each input component by the inverse of the respective Lipschitz constants, we should get an improvement. In math, this means we rescale by $(x,y)\mapsto f(\frac{x}{L_x},\frac{y}{L_y})$.
Now, how do we get $L_x$ and $L_y$? The key fact in this scheme is that any $\gamma\in[\sup_{z\in\mathbb{R}}|g'(z)|,\infty)$ is a Lipschitz constant of $g$. Using this fact as inspiration, we can guesstimate choices for $L_x$ and $L_y$ in their respective ranges. Note that if our current point is at $(x_0,y_0)$, the "$g$" above will be either $y\mapsto f(x_0,y)$ for guesstimating $L_y$ or $x\mapsto f(x,y_0)$ for guesstimating $L_x$. The hope is that, at the very least, our estimate of $L_x$ and $L_y$ will work locally. While $\sup_{z\in\mathbb{R}}|f'(z)|$ might not always be computable, a numerical approximation of the directional derivative is easy to compute (e.g. using a numerical differentiation scheme). To be safe and make sure we've really got an upper bound, we could scale up the magnitude of the numerical derivative by rescaling (e.g. $2\times$, if need be).
If you're actually going to implement this, then the idea above does dictate that you'll want to approximate $L_x$ and $L_y$ locally at every iteration. That said, you could probably recycle those gradient evaluations in your minimization method.