I have trouble understanding why this the claim is true:
Let $h(x) = g(Ax+b)$, $A$ some square matrix, then $\nabla h(x) = A^T \nabla g(Ax+b)$
Attempt:
Let $F(x) = Ax+b$, then $$\nabla h(x) = \begin{bmatrix} \dfrac{\partial h(x)}{\partial x_1} \\ \vdots \\ \dfrac{\partial h(x)}{\partial x_n} \end{bmatrix}$$
Suppose $a_k$ is the $k$th row in $A$
Then $$\dfrac{\partial h(x)}{\partial x_k} = \dfrac{\partial g(F(x))}{\partial F(x)}\dfrac{\partial F(x)}{\partial x_k} =\dfrac{\partial g(F(x))}{\partial F(x)} a_k^T$$
Then collecting all components $$\nabla h(x) = \begin{bmatrix} \dfrac{\partial h(x)}{\partial x_1} \\ \vdots \\ \dfrac{\partial h(x)}{\partial x_n} \end{bmatrix} = A^T\dfrac{\partial g(F(x))}{\partial F(x)}$$
But $\dfrac{\partial g(F(x))}{\partial F(x)}$ is or is not the gradient of $g(F(x))$? How can I see that?
Consider the more general situation in which {$x,b$} are matrices and $A$ is rectangular. For convenience, define the variable $w = (Ax+b).\,\,$ Then we have $h(x)=g(w)$.
I find it helpful to express the differentials in terms of the Frobenius Product $$\eqalign{ dh &= \frac{\partial h}{\partial x}:dx \cr }$$ and $$\eqalign{ dg &= \frac{\partial g}{\partial w}:dw \cr &= \frac{\partial g}{\partial w}:A\,dx \cr &= A^T\frac{\partial g}{\partial w}:dx \cr }$$ Since we must have $dh=dg$ for arbitrary $dx$, it follows that $$\frac{\partial h}{\partial x}=A^T\frac{\partial g}{\partial w}$$
NB: Rules for rearranging a Frobenius product follow from its equivalence to the trace, e.g. ${\rm tr}(A^TBC)=A:BC$, and the cyclic property of the trace. For example $$A:BC = B^TA:C$$