Let $x_{1}, \dots, x_{N}$ be a sequence of vectors in $\mathbb{R}^{n}$ and $A$ be an $n \times n$ matrix. Let $f: \mathbb{R} \to \mathbb{R}$ be a smooth (as smooth as you want) function. We define $$ h_{i} = f(Ax_{i}) $$ for $i = 1, \dots, N$, where $f$ is applied to the vector $Ax_{i}$ element-wise.
Suppose there is a function $G$ is a scalar function of the vectors $h_{1}, \dots, h_{N}$ (i.e. it maps to $\mathbb{R}$), and we want to find $\nabla_{A} G$, i.e. the gradient of $G$ with respect to the matrix $A$, so an $n \times n$ matrix with the $(i, j)$ entry being $\frac{\partial G}{\partial A_{ij}}$. Suppose we already knew all the gradients $\nabla_{h_{i}} G$, and want to use the chain rule along with these to find $\nabla_{A} G$. How would we go about doing that?
I know that the chain rule says that for generic vectors $x, y$ and a function $F$ mapping to $\mathbb{R}$, $$ \nabla_{x} F = \left(\frac{dy}{dx}\right)^{T} \nabla_{y} F $$ where $\frac{dy}{dx}$ the Jacobian of $y$ w.r.t. $x$. But how do I apply this to the setup above?
I suggest using tensor contraction mentioned in Chain rule with differentiation by vectors and matrices?
For example, Let $X=\nabla_x F$, $Y=\nabla_y F$, and $J=\frac{dy}{dx}$. Then $X_i=Y_j J^j_i,$ where $a_ib^i$ means $\sum_i a_ib^i$ by Einstein Summation. So if you write gradient as a row vector, we have $\nabla_x F=\nabla_y F\left(\frac{dy}{dx}\right),$ no need for transpose.
Back to the question. Formally we have $\nabla_A G=\nabla_{h} G\frac{\partial h}{\partial Ax}\frac{\partial Ax}{\partial A}$ by chain rules, where $h=(h_1,\cdots,h_n),$ $x=(x_1,\cdots,x_n)$ are both $n\times n$ matrices. Write it as $P=QHX.$
Then $P_{ij}=Q_{kl}H^{kl}_{rt}X^{rt}_{ij},$ $P_{ij}=\frac{\partial G}{\partial {A_{ij}}},$ $Q_{kl}=\frac{\partial G}{\partial h_{kl}}=(\nabla_{h_l}G)_k,$ $H^{kl}_{rt}=\frac{\partial h_{kl}}{\partial (Ax)_{rt}}=\frac{\partial f((Ax_l)_k)}{\partial (Ax_t)_r}=\delta_{r}^k\delta_t^l f'((Ax_l)_k),$ where $\delta_i^j=\begin{cases}1,i=j\\0,i\neq j\end{cases}$ is kronecker symbol. $X^{rt}_{ij}=\frac{\partial (Ax_t)_r}{\partial A_{ij}}=\delta^r_i (x_t)_j$.
So you can write $\nabla_A G=\nabla_{h} G\frac{\partial h}{\partial Ax}\frac{\partial Ax}{\partial A}$ simply, while the specific operation is given above. Notice that they are not normal matrices, they are tensors.