If $\bar f(y) = f(Ty)$, how is $\nabla \bar f(y) = T^T \nabla f(Ty)$?

66 Views Asked by At

From Convex Optimization by Boyd and Vandenberghe:

Let $T \in \Bbb R^{n \times n}$ be nonsingular. Let $f: \Bbb R^n \rightarrow \Bbb R$ convex and twice continuously differentiable.

Define $\bar f(y) = f(Ty)$.

Then $\nabla \bar f(y) = T^T \nabla f(Ty)$. where $T^T$ means the matrix $T$ transposed.

I must be missing something obvious with gradients, but can someone explain how the last part is true?

$\nabla \bar f(Ty) = (\nabla f(Ty))(\nabla(Ty)) = T^T\nabla f(Ty)$.

How does $(\nabla f(Ty))(\nabla(Ty)) = T^T\nabla f(Ty)$?

2

There are 2 best solutions below

0
On BEST ANSWER

This is really the chain rule,which states $d(\varphi \circ \psi)_y = d \varphi_{\psi(y)} \circ d\psi_y$. So in your case, we have \begin{align} \bar{f} = f \circ T. \end{align} Hence, \begin{align} d\bar{f}_{y} = df_{T(y)} \circ dT_y. \end{align} By assumption, $T$ is a matrix (or you can think of it as a linear transformation), hence $dT_y = T$. This yields \begin{align} d\bar{f}_{y} = df_{T(y)} \circ T \end{align} This is an equation concerning linear transformations. But if we write it as a matrix equation, it becomes \begin{align} \nabla\bar{f}(y) = \nabla f(T(y)) \cdot T \end{align} because composition of linear maps becomes a product of the respective matrix representations.

However, it seems the authors are using a different convention for the gradient; it seems like they want to think of a gradient as an $n \times 1$ matrix, rather than a $1 \times n$ matrix as I have shown above. This is really a matter of convention, so taking the transpose of the equation I have shown above yields the equation you have: \begin{align} \nabla\bar{f}(y) = T^t \cdot \nabla f(T(y)) \end{align}


Alternatively, you can do the long approach by verifying that each component of both sides are the same; i.e compute the $i^{th}$ partial derivative at $y$: $\partial_i \bar{f}(y)$, and use the chain rule. Regardless of which whether you do the component-by-component method, or any other method, you have to make use of the chain rule somewhere.

0
On

Let $z_i= T_{i,j}y_j$. Than $\partial_{y_k} z_i=T_{i,k}$

$\partial_{y_k} f(Ty)=\partial_{z_i} f(z)\partial_{y_k} z_i=\partial_{z_i} f(z)T_{i,k}=T^+_{k,i}\nabla_{i}f(Ty)=[T^+\nabla f(Ty)]_k$

summation over repeated indexes is implied. If the laplacian is needed, as suggested by the condition of twice differentiability, one can iterate and than two $T$ matrices will come out.

NB: this is the reason why the gradient is called a tensor.