Chain Rule for Matrix Valued Derivatives with intermediate high dimensional tensors

Question

Chain Rule for Matrix Valued Derivatives with intermediate high dimensional tensors

464 Views Asked by Bumbble Comm At 28 Mar 2026 - 5:28

I have one function $f: R^{m \times m} \to R$ and another function $g: R \to R^{m \times m}$. I would like to calculate the derivate $\frac{\partial h}{\partial x}$ for the composite function $h(x) = f(g(x))$. The expression I find is really messy and hard for me to do manipulation. I wonder if there is any more 'elegant' and algebra friendly way to do this.

The problem is that $\frac{\partial f}{\partial x}$ is a $(m \times m) \times 1$ tensor, and $\frac{\partial g}{\partial x}$ is a $1 \times (m \times m)$ tensor, and naively thinking that they return $m \times m$ matrices and apply matrix multiplication with the chain rule fails miserably.

$$ \frac{\partial h}{\partial x} \ne \frac{\partial f}{\partial g} \frac{\partial g}{\partial x} $$, at least not with 2D matrix multiplication.

Indeed, let $$ \begin{pmatrix} \frac{\partial f}{\partial y_{11}} \dots \frac{\partial f}{\partial y_{1m}} \\ \vdots \frac{\partial f}{\partial y_{m1}} \dots \frac{\partial f}{\partial y_{mm}} \\ \end{pmatrix} =\frac{\partial f(y)}{\partial y} $$ and

$$ \begin{pmatrix} \frac{\partial g_{11}}{\partial x} \dots \frac{\partial g_{1m}}{\partial x} \\ \vdots \frac{\partial g_{m1}}{\partial x} \dots \frac{\partial g_{mm}}{\partial x} \\ \end{pmatrix} =\frac{\partial g(x)}{\partial x} $$

What we want is: $$ \frac{\partial h}{\partial x} = \sum_{i=1}^m \sum_{j=1}^m \frac{\partial f}{\partial y_{ij}} \frac{\partial g_{ij}}{\partial x} $$

But my problem is that it is really difficult for me to do any manipulation if I write things like this. My question is if I could write things out more 'analytically.'

Below is an example where I want to find $x$ so that $\frac{\partial h}{\partial x}=0$, but with these relationships broken up into several equations and matrix math is broken down, it is really hard.

Below is a real example

Let $$ l(\Sigma^{-1})= - \frac{1}{2} y^T \Sigma^{-1} y - \frac{1}{2} \ln |\det \Sigma| $$, where $\Sigma$ is an $m \times m$ positive definite symmetric matrix and $y$ is an $m \times 1$ vector. (This is the log-likelihood of a multivariate gaussian distribution, but this is not important.)

l(\Sigma^{-1}) can be thought of a $R^{m \times m} \to R$ function.

Let $$ \Sigma(x) = K + x \mathbf{I} $$ where $K$ is an $m \times m$ positive definite symmetric matrix and $I$ is an $m \times m$ identity matrix.

$\Sigma(x)$ can be thought of as a $R \to R^{m \times m}$ function.

It is easy to show that: $$ \frac{\partial l}{\partial \Sigma^{-1}} = - \frac{1}{2} y y^T + \frac{1}{2} \Sigma $$

and $$ \frac{\partial \Sigma^{-1}}{\partial x}= -\Sigma^{-2} $$

To find $\frac{\partial l}{\partial x}$, we can do the following:

Let $\gamma_{ij}$ be the ith row and jth column of $\Sigma^{-1}$.

It is really: $$ \begin{pmatrix} \frac{\partial l}{\partial \gamma_{11}} \dots \frac{\partial l}{\partial \gamma_{1m}} \\ \vdots \\ \frac{\partial l}{\partial \gamma_{m1}} \dots \frac{\partial l}{\partial \gamma_{mm}} \\ \end{pmatrix} = \frac{\partial l}{\partial \Sigma^{-1}} = - \frac{1}{2} y y^T + \frac{1}{2} \Sigma $$ and $$ \begin{pmatrix} \frac{\partial \gamma_{11}}{\partial x} \dots \frac{\partial \gamma_{1m}}{\partial x} \\ \vdots \\ \frac{\partial \gamma_{m1}}{\partial x} \dots \frac{\partial \gamma_{mm}}{\partial x} \\ \end{pmatrix} = \frac{\partial \Sigma^{-1}}{\partial x}= -\Sigma^{-2} $$

What we are looking for is really: $$ \frac{\partial l}{\partial x} = \sum_{i=1}^m \sum_{j=1}^m \frac{\partial l}{\partial \gamma_{ij}} \frac{\partial \gamma_{ij}}{\partial x} $$

It takes me three equations to express $\frac{\partial l}{\partial x}$. It is extremely messy. My problem then is that there is no way for me to write it out more "analytically" so that I can find $x$ solve $$ \frac{\partial l}{\partial x} = 0 $$ analytically.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Using differentials is less error-prone (for me) than the chain rule because (algebraically) differentials act like normal vectors and matrices.

Applying the differential technique to your example $\big($for ease of typing let $M=\Sigma\big)$ $$\eqalign{ 2l &= -y^TM^{-1}y - \log\det(M) \cr 2dl &= -y^T\,dM^{-1}\,y - M^{-1}:dM \cr &= +y^T\big(M^{-1}\,dM\,M^{-1}\big)y - M^{-1}:dM \cr &= \Big(M^{-1}yy^TM^{-1} - M^{-1}\Big):dM \cr &= \Big(M^{-1}yy^TM^{-1} - M^{-1}\Big):I\,dx \cr &= {\rm Tr}\Big(M^{-1}yy^TM^{-1} - M^{-1}\Big)\,dx \cr \frac{dl}{dx} &= \tfrac{1}{2}{\rm Tr}\Big(M^{-1}yy^TM^{-1} - M^{-1}\Big) \cr &= \tfrac{1}{2}\Big(y^TM^{-2}y-{\rm Tr}(M^{-1})\Big) \cr }$$ where a colon was used to denote the matrix inner product, i.e. $\,A:B={\rm Tr}(A^TB).$

Due to the cyclic property of the trace, there are many ways to shuffle the terms in a such a product $$\eqalign{ A:BC &= A^T:(BC)^T = BC:A\cr&= B^TA:C = AC^T:B \cr I:B &= {\rm Tr}(I^TB) = {\rm Tr}(B) \cr }$$ A scalar value is equal to it own trace, therefore $$\eqalign{ x^TAx &= {\rm Tr}(x^TAx) = x:Ax = xx^T:A }$$ One last differential relationship comes from the defining property of the matrix inverse. $$\eqalign{ I &= M^{-1}M \cr dI &= M^{-1}dM + dM^{-1}M \,= 0 \cr dM^{-1} &= -M^{-1}\,dM\,M^{-1} \cr }$$

Chain Rule for Matrix Valued Derivatives with intermediate high dimensional tensors

Below is a real example

There are 1 best solutions below

Related Questions in MATRIX-CALCULUS

Trending Questions

Popular # Hahtags

Popular Questions