Consider vectors $x$ and matrix $A$, the vector $z = Ax$ and the scalar $E = z^Tz$. I want to compute the derivative $\frac{dE}{dA}$.
This is not too troublesome if we look at just one component of $E$:
\begin{align} \frac{\partial E}{\partial A_{ij}} &= \frac{\partial E}{\partial z_i} \frac{\partial z_i}{\partial A_{ij}} \\ &= 2z_i \frac{A_i^T x}{\partial A_{ij}} \qquad \text{(where $A_i$ is the $i$th row of $A$)} \\ &= 2z_i x_j \end{align}
Therefore $\frac{dE}{dA} = 2zx^T$.
How can we derive this result without using the technique of looking at a single entry? This plays out in the following way:
\begin{align} \frac{\partial E}{\partial A} &= \frac{\partial E}{\partial z} \frac{\partial z}{\partial A} \end{align} However, here is where we get stuck, because $\frac{\partial z}{\partial A}$ is a three dimensional tensor. In vectorized form (where the function $\text{vec}$ stacks the columns of a matrix into a big vector), we can find $\frac{\partial z}{\partial \text{vec} (A)}$:
\begin{align} dz &= (dA)x \\ &= \text{vec} (dA x) \\ &= \text{vec} (I dA x) \\ &= x^T \otimes I d(\text{vec} A) \qquad \text{($\otimes$ is the kronecker product)}\\ \frac{\partial z}{\partial \text{vec} (A)} &= x^T \otimes I \end{align}
And then the derivative can be computed: \begin{align} \frac{\partial E}{\partial z} \frac{\partial z}{\partial \text{vec} A} &= 2z^T (x^T \otimes I) \\ \frac{\partial E}{\partial z} \frac{\partial z}{\partial A} &= 2 \text{unvec}(z^T (x^T \otimes I)) \end{align}
However, this is problematic because $(x^T \otimes I)$ is a large matrix which is harder to deal with computationally, in addition to being very inelegant compared to the alternative answer.
So my question is, why does the derivation without using indices seem so inelegant compared to the derivation which does, and even gives a different answer? Moreover, is there a way to obtain such an answer without having to resort to looking at a single element?
To add on to that, here is another mystery: In both derivations, I used the numerator layout convention for derivatives, (otherwise in denominator layout the chain rule would be $\frac{\partial E}{\partial A} = \frac{\partial z}{\partial A} \frac{\partial E}{\partial z}$). However, when we force the first solution into a vector format at the end, we see that $\frac{\partial E}{\partial A}$ is the column vector $z$ -- where as according to numerator layout, it should be a row vector. What is the cause of this weirdness?
I'm aware that if we expand $z$ first and attempt to compute
$$\frac{dE}{dA} = \frac{d}{dA} (x^T A^T A x)$$
then it is possible to arrive at an answer like $2zx^T$. However, I am looking for a solution which does not do this expansion of $z$, and instead uses only the chain rule. The reason this is important is because an implementation of automatic differentiation (aka backpropagation) cannot perform this clever reasoning -- it only knows how to use the chain rule and how to take derivatives of each primitive function such as matrix multiplication ($z = Ax$).
By using differentials, instead of the chain rule, you can avoid dealing with the 3rd order tensor.
It will also be convenient to express the objective function using the trace/Frobenius product, i.e. $$\,A:BC={\rm tr}(A^TBC)$$ Like the Hadamard product, the matrices on each side of a Frobenius product must have the same shape. Which addresses your question about the layout convention that is used for the gradient.
Now we can jot down the function, differential, and gradient $$\eqalign{ E &= z:z \cr dE &= 2z:dz = 2z:dA\,x = 2zx^T:dA \cr \frac{\partial E}{\partial A} &= 2zx^T \cr }$$