From what I could understand reading some of the answers here, and reading some pdfs on matrix derivation, the general rule for scalar-to-matrix derivation is:
Let $g(X)=U$.
$$\frac{d}{d X}f(g(X))=\frac{d}{d X_{ij}}f(g(X)) = \sum_{k}\sum_l \frac{\partial}{\partial U_{kl}}f(U)\frac{\partial}{\partial X_{ij}}U_{kl}=Tr\left(\left(\frac{\partial}{\partial U}f(U)\right)^\intercal \frac{\partial}{\partial X_{ij}}U\right)$$
However, the differential notation is usually more used. And the differential formula I've seen being used is if $$df=Tr\left(\left(A \right)^\intercal dX \right)$$ then $$\frac{d}{d X}f(g(X))=A$$
How does one reconcile both notations?
In the first case, you've simply written $$\eqalign{ \frac{\partial f}{\partial X} &= \frac{\partial f}{\partial U}:\frac{\partial U}{\partial X} \cr }$$ In the second case, you've stated the definition of the differential in terms of the gradient $$\eqalign{ df &= A:dX \cr &= \Big(\frac{\partial f}{\partial X}\Big):dX \cr &= \Big(\frac{\partial f}{\partial U}:\frac{\partial U}{\partial X}\Big):dX \cr }$$
I'm not sure what needs to be reconciled; the two cases are consistent with one another.
Note however that $\frac{\partial U}{\partial X}$ is a $4^{th}$ order tensor, which will be tricky to work with.
*[Instead of the functional notation ${\,\rm Tr}\big(A^T\,dX\big)\,$ I've used the product notation $\big(A:dX\big)$