invariance under transformation of optimization methods

1k Views Asked by At

I am reading about newton and quasi-newton optimization algorithms and there is a mention of their invariance (or lack thereof) under transformation and how it helps them tackle ill posed problems. But I cant seem to understand why it would be so and under what kinds of issues with Hessian. Any insight/intuition about how invariance is relevant or any references would be very helpful

1

There are 1 best solutions below

0
On

First of all, here's a proof that Newton's method is invariant with respect to a change of basis:

Newton's method minimizes a smooth function $f:\mathbb R^n \to \mathbb R$ using the iteration $$ x^{n+1} = x^n - Hf(x^n)^{-1} \nabla f(x^n), $$ where $Hf(x^n)$ is the Hessian of $f$ at $x^n$.

Let's make a change of variables $y = Ax$, where $A$ is an invertible matrix. Minimizing $f$ with respect to $x$ is equivalent to minimizing $g(y) = f(A^{-1}y)$ with respect to $y$.

From the chain rule, $\nabla g(y) = A^{-T} \nabla f(A^{-1} y)$ and $Hg(y) = A^{-T} Hf(A^{-1}y) A^{-1}$. The Newton's method iteration to minimize $g$ is \begin{align} y^{n+1} &= y^n - A Hf(A^{-1}y^n)^{-1} A^T A^{-T} \nabla f(A^{-1}y^n) \\ &= y^n - A Hf(A^{-1}y^n)^{-1} \nabla f(A^{-1}y^n). \end{align} Multiplying both sides by $A^{-1}$ and defining $x^n = A^{-1}y^n$, we find that $$ x^{n+1} = x^n - Hf(x^n)^{-1} \nabla f(x^n). $$ This is the Newton's method iteration for minimizing $f$ with respect to $x$.

For optimization algorithms such as gradient descent that do not have this invariance, it is often important or helpful to select a matrix $A$ so that the new function $g(y) = f(A^{-1}y)$ is better conditioned than $f$ in some sense. The level curves of $g$ should be like circles instead of like highly elongated ellipses --- that will make gradient descent converge faster. But for Newton's method we don't have to worry about doing this.