From Convex Optimization by Boyd and Vandenberghe:
Let $T \in \Bbb R^{n \times n}$ be nonsingular. Let $f: \Bbb R^n \rightarrow \Bbb R$ convex and twice continuously differentiable.
Define $\bar f(y) = f(Ty)$.
Then $\nabla \bar f(y) = T^T \nabla f(Ty)$. where $T^T$ means the matrix $T$ transposed.
I must be missing something obvious with gradients, but can someone explain how the last part is true?
$\nabla \bar f(Ty) = (\nabla f(Ty))(\nabla(Ty)) = T^T\nabla f(Ty)$.
How does $(\nabla f(Ty))(\nabla(Ty)) = T^T\nabla f(Ty)$?
This is really the chain rule,which states $d(\varphi \circ \psi)_y = d \varphi_{\psi(y)} \circ d\psi_y$. So in your case, we have \begin{align} \bar{f} = f \circ T. \end{align} Hence, \begin{align} d\bar{f}_{y} = df_{T(y)} \circ dT_y. \end{align} By assumption, $T$ is a matrix (or you can think of it as a linear transformation), hence $dT_y = T$. This yields \begin{align} d\bar{f}_{y} = df_{T(y)} \circ T \end{align} This is an equation concerning linear transformations. But if we write it as a matrix equation, it becomes \begin{align} \nabla\bar{f}(y) = \nabla f(T(y)) \cdot T \end{align} because composition of linear maps becomes a product of the respective matrix representations.
However, it seems the authors are using a different convention for the gradient; it seems like they want to think of a gradient as an $n \times 1$ matrix, rather than a $1 \times n$ matrix as I have shown above. This is really a matter of convention, so taking the transpose of the equation I have shown above yields the equation you have: \begin{align} \nabla\bar{f}(y) = T^t \cdot \nabla f(T(y)) \end{align}
Alternatively, you can do the long approach by verifying that each component of both sides are the same; i.e compute the $i^{th}$ partial derivative at $y$: $\partial_i \bar{f}(y)$, and use the chain rule. Regardless of which whether you do the component-by-component method, or any other method, you have to make use of the chain rule somewhere.