I have one function $f: R^{m \times m} \to R$ and another function $g: R \to R^{m \times m}$. I would like to calculate the derivate $\frac{\partial h}{\partial x}$ for the composite function $h(x) = f(g(x))$. The expression I find is really messy and hard for me to do manipulation. I wonder if there is any more 'elegant' and algebra friendly way to do this.
The problem is that $\frac{\partial f}{\partial x}$ is a $(m \times m) \times 1$ tensor, and $\frac{\partial g}{\partial x}$ is a $1 \times (m \times m)$ tensor, and naively thinking that they return $m \times m$ matrices and apply matrix multiplication with the chain rule fails miserably.
$$ \frac{\partial h}{\partial x} \ne \frac{\partial f}{\partial g} \frac{\partial g}{\partial x} $$, at least not with 2D matrix multiplication.
Indeed, let $$ \begin{pmatrix} \frac{\partial f}{\partial y_{11}} \dots \frac{\partial f}{\partial y_{1m}} \\ \vdots \frac{\partial f}{\partial y_{m1}} \dots \frac{\partial f}{\partial y_{mm}} \\ \end{pmatrix} =\frac{\partial f(y)}{\partial y} $$ and
$$ \begin{pmatrix} \frac{\partial g_{11}}{\partial x} \dots \frac{\partial g_{1m}}{\partial x} \\ \vdots \frac{\partial g_{m1}}{\partial x} \dots \frac{\partial g_{mm}}{\partial x} \\ \end{pmatrix} =\frac{\partial g(x)}{\partial x} $$
What we want is: $$ \frac{\partial h}{\partial x} = \sum_{i=1}^m \sum_{j=1}^m \frac{\partial f}{\partial y_{ij}} \frac{\partial g_{ij}}{\partial x} $$
But my problem is that it is really difficult for me to do any manipulation if I write things like this. My question is if I could write things out more 'analytically.'
Below is an example where I want to find $x$ so that $\frac{\partial h}{\partial x}=0$, but with these relationships broken up into several equations and matrix math is broken down, it is really hard.
Below is a real example
Let $$ l(\Sigma^{-1})= - \frac{1}{2} y^T \Sigma^{-1} y - \frac{1}{2} \ln |\det \Sigma| $$, where $\Sigma$ is an $m \times m$ positive definite symmetric matrix and $y$ is an $m \times 1$ vector. (This is the log-likelihood of a multivariate gaussian distribution, but this is not important.)
l(\Sigma^{-1}) can be thought of a $R^{m \times m} \to R$ function.
Let $$ \Sigma(x) = K + x \mathbf{I} $$ where $K$ is an $m \times m$ positive definite symmetric matrix and $I$ is an $m \times m$ identity matrix.
$\Sigma(x)$ can be thought of as a $R \to R^{m \times m}$ function.
It is easy to show that: $$ \frac{\partial l}{\partial \Sigma^{-1}} = - \frac{1}{2} y y^T + \frac{1}{2} \Sigma $$
and $$ \frac{\partial \Sigma^{-1}}{\partial x}= -\Sigma^{-2} $$
To find $\frac{\partial l}{\partial x}$, we can do the following:
Let $\gamma_{ij}$ be the ith row and jth column of $\Sigma^{-1}$.
It is really: $$ \begin{pmatrix} \frac{\partial l}{\partial \gamma_{11}} \dots \frac{\partial l}{\partial \gamma_{1m}} \\ \vdots \\ \frac{\partial l}{\partial \gamma_{m1}} \dots \frac{\partial l}{\partial \gamma_{mm}} \\ \end{pmatrix} = \frac{\partial l}{\partial \Sigma^{-1}} = - \frac{1}{2} y y^T + \frac{1}{2} \Sigma $$ and $$ \begin{pmatrix} \frac{\partial \gamma_{11}}{\partial x} \dots \frac{\partial \gamma_{1m}}{\partial x} \\ \vdots \\ \frac{\partial \gamma_{m1}}{\partial x} \dots \frac{\partial \gamma_{mm}}{\partial x} \\ \end{pmatrix} = \frac{\partial \Sigma^{-1}}{\partial x}= -\Sigma^{-2} $$
What we are looking for is really: $$ \frac{\partial l}{\partial x} = \sum_{i=1}^m \sum_{j=1}^m \frac{\partial l}{\partial \gamma_{ij}} \frac{\partial \gamma_{ij}}{\partial x} $$
It takes me three equations to express $\frac{\partial l}{\partial x}$. It is extremely messy. My problem then is that there is no way for me to write it out more "analytically" so that I can find $x$ solve $$ \frac{\partial l}{\partial x} = 0 $$ analytically.
Using differentials is less error-prone (for me) than the chain rule because (algebraically) differentials act like normal vectors and matrices.
Applying the differential technique to your example $\big($for ease of typing let $M=\Sigma\big)$ $$\eqalign{ 2l &= -y^TM^{-1}y - \log\det(M) \cr 2dl &= -y^T\,dM^{-1}\,y - M^{-1}:dM \cr &= +y^T\big(M^{-1}\,dM\,M^{-1}\big)y - M^{-1}:dM \cr &= \Big(M^{-1}yy^TM^{-1} - M^{-1}\Big):dM \cr &= \Big(M^{-1}yy^TM^{-1} - M^{-1}\Big):I\,dx \cr &= {\rm Tr}\Big(M^{-1}yy^TM^{-1} - M^{-1}\Big)\,dx \cr \frac{dl}{dx} &= \tfrac{1}{2}{\rm Tr}\Big(M^{-1}yy^TM^{-1} - M^{-1}\Big) \cr &= \tfrac{1}{2}\Big(y^TM^{-2}y-{\rm Tr}(M^{-1})\Big) \cr }$$ where a colon was used to denote the matrix inner product, i.e. $\,A:B={\rm Tr}(A^TB).$
Due to the cyclic property of the trace, there are many ways to shuffle the terms in a such a product $$\eqalign{ A:BC &= A^T:(BC)^T = BC:A\cr&= B^TA:C = AC^T:B \cr I:B &= {\rm Tr}(I^TB) = {\rm Tr}(B) \cr }$$ A scalar value is equal to it own trace, therefore $$\eqalign{ x^TAx &= {\rm Tr}(x^TAx) = x:Ax = xx^T:A }$$ One last differential relationship comes from the defining property of the matrix inverse. $$\eqalign{ I &= M^{-1}M \cr dI &= M^{-1}dM + dM^{-1}M \,= 0 \cr dM^{-1} &= -M^{-1}\,dM\,M^{-1} \cr }$$