Chain rule in matrix calculus

145 Views Asked by At

Found some problems with next task.

Given next vector function: $ \mathbf{u}(\mathbf{x}) = A_{[n, m]} \cdot f(B_{[m, n]} \cdot \mathbf{x})$, $~ \mathbf{u}: \mathbb{R}^{n} \to \mathbb{R}^n$, $A_{[n, m]}$ — matrix with $n$ rows and $m$ columns, and $f(\mathbf{v}) = \left\lbrace f(v_1), \dots, f(v_m) \right\rbrace$. Taking it's derivative wrt $\mathbf{x}$ gives next: $$ \frac{d\mathbf{u}}{d\mathbf{x}} = \mathbf{u}'_{\mathbf{x}} = B_{[m, n]} \cdot A_{[n, m]} \cdot f'(B_{[m, n]} \cdot \mathbf{x}), ~ \mathbf{u}'_{\mathbf{x}}: \mathbb{R}^{n} \to \mathbb{R}^{[m, n]}, $$ or, it also possible next: $$ \frac{d\mathbf{u}}{d\mathbf{x}} = \mathbf{u}'_{\mathbf{x}} = A_{[n, m]} \cdot f'(B_{[m, n]} \cdot \mathbf{x}) \cdot (B_{[m, n]})^{\top}, ~ \mathbf{u}'_{\mathbf{x}}: \mathbb{R}^{n} \to \mathbb{R}^{[n, m]}. $$

So, there is three questions:

  1. How can I find derivative more complicated function, such that: $$ \mathbf{u}(\mathbf{x}) = A_{[n, l]} \cdot f(B_{[l, m]} \cdot g(C_{[m, n]} \cdot \mathbf{x}) ), \quad \text{let} ~ \mathbf{v}(\mathbf{x}) = B_{[l, m]} \cdot g(C_{[m, n]} \cdot \mathbf{x}). $$ Next variants are working for this function: $$ \frac{d\mathbf{u}}{d\mathbf{x}} = \mathbf{u}'_{\mathbf{x}} = C_{[m,n]} \cdot A_{[n, l]} \cdot f'(\mathbf{v}) \cdot ( B_{[l, m]} \cdot g'(C_{[n, m]} \cdot \mathbf{x}) )^{\top}, ~ \mathbf{u}'_{\mathbf{x}} : \mathbb{R}^{n} \to \mathbb{R}^{[m, l]}, $$ or $$ \frac{d\mathbf{u}}{d\mathbf{x}} = \mathbf{u}'_{\mathbf{x}} = B_{[l, m]} \cdot g'(C_{[m, n]} \cdot \mathbf{x}) \cdot A_{[n, l]} \cdot f'(\mathbf{v}) \cdot ( C_{[m, n]} )^{\top}, ~ \mathbf{u}'_{\mathbf{x}} : \mathbb{R}^{n} \to \mathbb{R}^{[l, m]}. $$ Is there a general formula to calculate derivative of more nested functions?

  2. Using same idea, how can I calculate second and higher derivatives of these functions?

  3. How to calculate any combinations with derivatives, such that $f(x, u, u')$. I think this question can be solved using norms of objects, to use them in estimations as scalars, but don't know, is it a good idea, and how good this approximation will be.

Update:

For one of these questions founded a solution using a little bit different approach, but correct for this kind of problem. I found that: $$ u_k = \sum_{i=1}^{m} a_{ki} \cdot f\left( \sum_{j=1}^{n} b_{ij} x_j \right) $$ so: $$ \frac{du_k}{dx_k} = \sum_{i=1}^{m} a_{ki} b_{ik} \cdot f'\left( \sum_{j=1}^{n} b_{ij} x_j \right) $$ implies approach like in comment below: $$ \frac{d\mathbf{u}}{d\mathbf{x}} = \sum_{k = 0}^{m} a_{k, *} \cdot \left( f'\left( B_{[m, n]} \cdot \mathbf{x} \right) \odot b_{*, k} \right) $$ where $a_{k, *}$$k$'th row of matrix $A$; $b_{*, k}$$k$'th column of matrix $B$, and $A \odot B$ — element-wise product of matrices similar sizes. This formula gives me vector of derivatives in certain point $\mathbf{x}$ with same dimension as initial vector.

1

There are 1 best solutions below

1
On

$ \def\BR#1{\left[#1\right]} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\diag#1{\op{diag}\LR{#1}} \def\Diag#1{\op{Diag}\LR{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} $In Matrix Calculus it is usually easier to use differentials instead of the Chain Rule. As an example of the technique, let's work through your first problem.

For typing convenience, define the vector variables $$\eqalign{ y &= Bx &\qiq dy = B\:dx \\ f &= f(y) &\qiq F = \Diag{f} \\ f' &= f'(y) &\qiq df = f'\odot dy = F'\,dy \\ }$$ where $\odot$ is the elementwise/Hadamard product and the Diag() function creates a diagonal matrix from its vector argument.

Using the above notation, rewrite the $u$ vector, then calculate its differential and gradient $$\eqalign{ u &= Af \\ du &= A\:df = AF'\,dy = AF'B\:dx \\ \grad{u}{x} &= AF'B = A\cdot\Diag{f'(Bx)}\cdot B \\ }$$ The idea is to manipulate the differential expression (via backsubstitution of more elementary differentials) into a form wherein the gradient can be readily identified $$du = G\:dx \qiq \grad{u}{x} = G$$ Since $G$ is a matrix, if you wish to differentiate a second time (to obtain the Hessian) you will encounter a problem: the gradient of a matrix with respect to a vector is neither a matrix nor a vector, but a third-order tensor. Similarly, matrix-by-matrix differentiation generates fourth-order tensors. And so on.

The solution is either to abandon matrix-vector notation and use index notation (i.e. componentwise differentiation), or to use vectorization to flatten matrix variables into vectors.

Either approach will work, but index notation is much more flexible.

Update

For nested function problems, consider the following scenario...

After deriving the above gradient wrt $x,\,$ someone tells you that $x$ is not the independent variable, but that $x$ is actually a function of a more primitive variable, i.e. $$\eqalign{ x &= h = h(Cz) &\qiq u = A\,f\LR{B\:h(Cz)} \\ H &= \Diag{h} \\ }$$ To calculate $\,\large\grad{u}{z}\:$ you would proceed as follows.

First calculate the differential of $x$ $$\eqalign{ dx &= h'\odot\LR{C\,dz} \\ &= H'C\,dz \\ }$$ Then substitute this into the previously derived expression for $du$ $$\eqalign{ du &= AF'B\;\c{dx} \\ &= AF'B\;\c{H'C\,dz} \\ \grad{u}{z} &= AF'B\,H'C \\ }$$