Why is affine invariance a good property for Newton's descent methods?

203 Views Asked by At

I know about affine invariance property for Newton's descent method, but i don't have any idea about it's importance. Can you tell me about it?

Here is the definition of affine invariance of Newton's method:

Assume $f : \Bbb{R}^n \to \Bbb{R}$ is twice differentiable and $A \in \Bbb{R}^{n \times n}$ is nonsingular. Let $g(y) := f(Ay)$. Then, Newton step for $g$ at the point $y$ is given by $$y^+ = y - \left(\nabla^2 g(y)\right)^{-1}\nabla g(y)$$ For the affine transformation $x = Ay$, it turns out that the Newton step for $f$ at the point $x$ is $X^+ = Ay^+$. This means that the progress of Newton's method is independent of linear scaling. This property is not true for gradient descent.

Thanks!

1

There are 1 best solutions below

3
On BEST ANSWER

To fix the ideas let $\Omega$ be an open set of $\mathbb{R}^{n}$. Let $f:\Omega \to \mathbb{R}$ be a continuously differentiable function. Set $p\in \Omega$.

Newton's descent method for finding a critical point of function $\Omega \ni x \mapsto f(x) \in\mathbb{R}$ is a method that generates a sequence $x_1, x_2, x_3,\ldots, x_{n-1}, x_{n},\ldots $ from a starting point $x_{0}=p$, using the following rule $$ x_{n}=x_{n-1} -\Big((\nabla (\nabla f))(x_{n-1})\Big)^{-1}\cdot \nabla f(x_{n-1}) \qquad \qquad (1) $$ That is, Newton's descent method is nothing more than Newton's method for finding the root of $(\nabla f)(x)=0$. If we replace the function $\Omega \ni x \mapsto \nabla f(x) \in\mathbb{R}^n$ by composition $\Omega \ni x \mapsto (A\circ \nabla f)(x)= A\cdot \nabla f( x ) \in \mathbb{R}^n$ in $(1)$ the Newton's method would, in principle, generate another sequence $y_1,y_2,y_3,\ldots, y_{n-1},y_{n},\ldots $ from a starting point $y_{0}=p$ by the following recursive procedure $$ y_{n}=y_{n-1} -\Big(\nabla (A \circ \nabla f)(y_{n-1})\Big)^{-1}\cdot \nabla (A\circ f)(y_{n-1}) \qquad \qquad (2) $$

Note that if $A$ is non-singular then both $A\cdot (\nabla f)$ and $\nabla f$ have the same root.

DEFINITION: A method that generates a sequence $x_1,x_2,x_3,\ldots, x_{n-1},x_{n} \ldots $ to find a critical point of a function $f$ ( or a zero of $\nabla f(x)=0$)is invariant by an affine application $\mathbb{R}^{n} \ni x \mapsto Ax\in \mathbb{R}^{n}$ if the method to find a root of $A \cdot (\nabla f)(x)=0$ generates a sequence $y_1, y_2,y_3,\ldots, y_{n-1},y_{n},\ldots $ that is identical to the sequence $x_1,x_2,x_3,\ldots, x_{n-1},x_{n},\ldots $ for finding a root function $A\nabla f(x)=0$.

Here the symbol $\cdot$ means product of matrices. And the composition of two matrices $M$ and $N$ has the following property $M\circ N=M\cdot N$. Remember that $$ \begin{array}{rclclr} (\nabla A)(x) &=& A&& & \qquad (3.a) \\ \nabla (A\circ f)(x)&=&(\nabla A)(f(x))\cdot (\nabla f)(x) &=& A\cdot (\nabla f)(x) & \qquad (3.b) \\ \nabla^{2}(A\circ f)(x)&=&\nabla\Big(\nabla (A\circ f) \Big)(x) &&& \qquad \\ &=&\nabla\Big(A\cdot (\nabla f) \Big)(x) &&& \qquad \\ &=&\nabla\Big(A\circ (\nabla f) \Big)(x) &&& \qquad \\ &=&\nabla A(\nabla f(x)) \cdot \big( \nabla (\nabla f)\big)(x) &&& \qquad \\ &=&A \cdot (\nabla^2 f)(x) &&& \qquad (3.c) \\ \Big( A \cdot (\nabla^2 f)(x) \Big)^{-1}&=&\Big( (\nabla^2 f)(x)\Big)^{-1}A^{-1} &&& \qquad (3.d) \end{array} $$ By $(3.a), (3.b), (3.c), (3.d)$ we have \begin{align} y_{1}= y_{0}-\nabla^{2}(A\circ f)(y_{0})\cdot (A\circ f)(y_{0}) =& \; y_{0}-\Big( (\nabla^2 f)(y_{0})\Big)^{-1}A^{-1}\cdot A\cdot f(y_{0}) \\ =& \; y_{0}-\Big( (\nabla^2 f)(y_{0})\Big)^{-1}f(y_{0}) \\ =& \; x_{0}-\Big( (\nabla^2 f)(x_{0})\Big)^{-1}f(x_{0}) \\ =&\;x_{1} \end{align} In general, for all $n\in \mathbb{N}$, if $y_{n-1}=x_{n-1}$, we have \begin{align} y_{n}= y_{n-1}-\nabla^{2}(A\circ f)(y_{n-1})\cdot (A\circ f)(y_{n-1}) =& \; y_{n-1}-\Big( (\nabla^2 f)(y_{n-1})\Big)^{-1}A^{-1}\cdot A\cdot f(y_{n-1}) \\ =& \; y_{n-1}-\Big( (\nabla^2 f)(y_{n-1})\Big)^{-1}f(y_{n-1}) \\ =& \; x_{n-1}-\Big( (\nabla^2 f)(x_{n-1})\Big)^{-1}f(x_{n-1}) \\ =&\;x_{n} \end{align}

But why is this invariance important? Because it simplifies the proof of the convergence theorems for the sequence $$x_1, x_2, x_3,\ldots, x_{n-1}, x_{n},\ldots $$ generated above by Newton's method. For example, several versions of Kantorovich's theorem attesting to the convergence of the sequence generated by Newton's method is simplified by affine invariance. See for example this divulgation paper, in this web page, about the Newton-Kantorovich theorem. For a classic paper on affine invariance propertie and theorems for convergence of Newton's method you can also see this.