Gradient and Hessian for linear change of coordinates

1.7k Views Asked by At

I am studying the affine invariance of the Newton step for unconstrained optimization and I came across with the following:

Suppose $\textbf{T}\in \Re ^{n\times n}$ is nonsingular and define $\tilde{f}(\textbf{y})=f(\textbf{T}\textbf{y})$. Then we have the following statements: $$\nabla \tilde{f}(\textbf{y})=\textbf{T}^T\nabla f(\textbf{x})$$ and $$\nabla ^2 \tilde{f}(\textbf{y})=\textbf{T}^T\nabla ^2 f(\textbf{x})\textbf{T}$$ where $\textbf{x}=\textbf{Ty}$

I don't understand why the above are correct. Is there any property I have to apply?

2

There are 2 best solutions below

0
On BEST ANSWER

It helps to look at the Taylor expansions of $f$ and $\tilde{f}$:

$$f(z+\Delta z) \approx f(z) + \langle \nabla f(z), \Delta z \rangle + \tfrac{1}{2}\langle \nabla^2 f(z) \Delta z, \Delta z \rangle$$ $$\tilde{f}(y+\Delta y) \approx \tilde{f}(y) + \langle \nabla \tilde{f}(y), \Delta y \rangle + \tfrac{1}{2}\langle \nabla^2 \tilde{f}(y) \Delta y, \Delta y \rangle$$ Substitute $z=T y$ into the first of these:

$$\begin{aligned}f(T(y+\Delta y)) &\approx f(Ty) + \langle \nabla f(Ty), T \Delta y \rangle + \tfrac{1}{2}\langle \nabla^2 f(Ty) T \Delta y, T \Delta y \rangle\\ &= f(Ty) + \langle T^T \nabla f(Ty), \Delta y \rangle + \tfrac{1}{2}\langle T^T \nabla^2 f(Ty) T \Delta y, \Delta y \rangle \end{aligned}$$ The second line utilizes a known property of inner products and linear operators: that $\langle x, A y\rangle = \langle A^T x, y \rangle$.

From here we can match up terms and see that $$\tilde{f}(y) = f(Ty), \quad \nabla \tilde{f}(y) = T^T \nabla f(Ty), \quad \nabla^2 \tilde{f}(y) = T^T \nabla^2 f(Ty) T$$

Notice that we did not rely on $T$ being nonsingular. That's because it doesn't have to be; in fact, it doesn't even have to be square. But note also that this is not a rigorous proof, it's just something to help you see it more intuitively.

0
On

Background and notation: First we need some background on taking derivatives in a multivariable setting, using matrix notation.

Let $h:\mathbb R^n \to \mathbb R^m$ be differentiable at $x$. Then $h'(x)$ is an $m \times n$ matrix. (Intuitively, $h(x+\Delta x) \approx h(x) + h'(x) \Delta x$, when $\Delta x$ is a small $n \times 1$ column vector.)

Suppose $g:\mathbb R^m \to \mathbb R^n$ is differentiable at $x$, and $f:\mathbb R^n \to \mathbb R^p$ is differentiable at $g(x)$, and $h = f \circ g$. The multivariable chain rules tells us that $h'(x) = f'(g(x)) g'(x)$.

If $f(x) = Mx$, where $M$ is a matrix, then $f'(x) = M$.

If $f(x) = M g(x)$, where $M$ is a matrix and $g$ is differentiable, then $f'(x) = M g'(x)$.

Suppose $f:\mathbb R^n \to \mathbb R$ is differentiable at $x$. Note that $f'(x)$ is a $1 \times n$ row vector, and $\nabla f(x) = f'(x)^T$ is an $n \times 1 $ column vector.

Suppose $f: \mathbb R^n \to \mathbb R$ is differentiable and let $h(x) = \nabla f(x)$ for all $x$. Then $h'(x)$ is the Hessian of $f$ at $x$. I'll denote the Hessian of $f$ at $x$ by $\nabla^2 f(x)$.

Answer to question: The formulas you asked about follow from the multivariable chain rule. Since $\tilde f(y) = f(Ty)$, the multivariable chain rule tells us that $\tilde f'(y) = f'(Ty) T $. Now take the transpose of both sides to get your formula for $\nabla \tilde f(y) $:

\begin{equation} \nabla \tilde f(y) = \tilde f'(y)^T = T^T \nabla f(Ty). \end{equation}

Now let $h(y) = \nabla \tilde f(y) = T^T \nabla f(Ty)$. The Hessian of $\tilde f$ at $y$ is \begin{equation} h'(y) = T^T \nabla^2 f(Ty) T. \end{equation}