derivative of matrix with respect to vector

1.4k Views Asked by At

I need to calculate the derivative of matrix w.r.t. vector.

< Given Equation >
1) $\mathbb Y = \mathbb A \mathbb X$
,where
$\mathbb A$: (n$\times$n) matrix
$\mathbb X$: (n$\times$1) vector.


2) all elements in $\mathbb A$ and $\mathbb X$ are the function of $z_i$, where
$\mathbb Z = [z_1\ z_2\ \cdots\ z_m]^\top$
In other words,
$\mathbb Y(z)=\mathbb A(z) \mathbb X(z)$

< Problem definition >
I want to calculate the following partial derivative: $\frac{\partial \mathbb Y}{\partial \mathbb Z}$, which yields a (n$\times$m) matrix
From the general derivation rule for multiplication, it looks like the rule can be expanded (with some modifications) to the matrix/vector version,

$\frac{\partial \mathbb Y}{\partial \mathbb Z} = \frac{\partial (\mathbb A \mathbb X)}{\partial \mathbb Z} = \frac{\partial \mathbb A}{\partial \mathbb Z}\mathbb X + \mathbb A \frac{\partial \mathbb X}{\partial \mathbb Z}$

However, the above rule is wrong, as you can easily see that the first term's dimension doesn't coincide with (n$\times$m).

I want to calculate the derivation without explicitly calculating all elements in the output $\mathbb Y$. How can I solve this problem?

3

There are 3 best solutions below

0
On BEST ANSWER

Your formula should be correct, when interpreted correctly.

Let's first investigate $\frac{\partial\mathbb{A}}{\partial\mathbb{Z}}$. $\mathbb{A}$ is an $n\times m$ matrix and $\mathbb{Z}$ is a vector with $m$ entries. This means, to specify a derivative, you need three coordinates: the $(i,j)$ for the entry of $\mathbb{A}$ and $k$ for the choice of variable for the derivative. Therefore, $\frac{\partial\mathbb{A}}{\partial\mathbb{Z}}$ is really a $3$-tensor, and a $3$-tensor times a vector is a matrix.

Similarly, $\frac{\partial\mathbb{X}}{\partial\mathbb{Z}}$ is a matrix because there are two coordinates, $i$ for the entry of $\mathbb{X}$ and $j$ for the choice of derivative. Hence, $\mathbb{A}\frac{\partial\mathbb{X}}{\partial\mathbb{Z}}$ is a product of matrices, and is itself a matrix.

If you want to figure out the formula a little more explicitly, if we write $\mathbb{A}=(a_{ij}(z))$ and $\mathbb{X}=(x_k(z))$, then $$ (\mathbb{Y})_i=(\mathbb{A}\mathbb{X})_i=\sum_j a_{ij}(z)x_j(z). $$ The partial derivative of this with respect to $z_k$ is $$ \frac{\partial}{\partial z_k}(\mathbb{Y})_i=\frac{\partial}{\partial z_k}\sum_j a_{ij}(z)x_j(z)=\sum_j\left(\frac{\partial}{\partial z_k}a_{ij}(z)\right)x_j(z)+\sum_ja_{ij}(z)\frac{\partial}{\partial z_k}x_j(z). $$ We can then combine all of these into a vector by dropping the $i$ to get $$ \frac{\partial\mathbb{Y}}{\partial z_k}=\frac{\partial\mathbb{A}}{\partial z_k}\mathbb{X}+\mathbb{A}\frac{\partial\mathbb{X}}{\partial z_k}. $$ This gives you the columns of the Jacobian, so they can then be put all together.

0
On

$ \def\n{\nabla_z}\def\bb{\mathbb} \def\e{\varepsilon}\def\p{\partial} \def\L{\left}\def\R{\right}\def\LR#1{\L(#1\R)} \def\vecc#1{\operatorname{vec}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $For typing convenience, use the convention wherein lower/upper case letters denote vectors/matrices and rename the problem's variables $$\{\bb{A,X,Y,X}\} \to \{A,x,y,x\}$$ The gradient calculation with respect to scalar components of $z$ obeys the usual product rule $$\eqalign{ y &= Ax \\ \c{\grad y{z_k}} &= A\gradLR x{z_k} \;+\; \gradLR A{z_k}x \\ }$$ To convert these component gradients into the desired matrix-valued gradient, multiply by the corresponding vector $\{e_k\}$ from the standard basis for ${\mathbb R}^{m}$ and sum $$\eqalign{ \grad yz &= {\large\sum_k} \;\c{\gradLR y{z_k}}e_k^T \\ &= A\gradLR xz \;+\; x^T\!\LR{\grad {A^T}{z}} \\ }$$ The final term is ugly because the quantity in parentheses is a third-order tensor, which is difficult to render in matrix notation, but trivial to write using index notation $$\eqalign{ \grad{y_i}{z_j} &= {\large\sum_k}\; A_{ik}\gradLR{x_k}{z_j} \;+\; x_k\gradLR{A_{ik}}{z_j} \\ \\ }$$ A common technique to side-step the tensor issue is to vectorize the $A$ matrix $$\eqalign{ a &= \vecc A \\ y &= Ax \\ dy &= A\;dx + dA\;x \\ &= A\;dx \;+\; \LR{x^T\otimes I} da \\ \grad yz &= A\gradLR xz \;+\; \LR{x^T\otimes I}\gradLR az \\ }$$ where $(\otimes)$ denotes the Kronecker product and $I\in{\bb R}^{n\times n}\,$ is the identity matrix.
This puts everything in terms of familiar vector-by-vector gradients.

0
On

Write $y(z) = A(z)x(z)$. We can approach the computation of the linear map $Dy(z) : \mathbb{R}^m \to \mathbb{R}^n$ using the Frechet derivative. For $v \in \mathbb{R}^m$ small, we have $$A(z + v) = A(z) + DA(z)v + o(v),$$ $$x(z + v) = x(z) + Dx(z)v + o(v).$$ Therefore \begin{align} y(z + v) &= (A(z) + DA(z)v)(x(z) + Dx(z)v) + o(v) \\ &= A(z)x(z) + A(z)Dx(z)v + (DA(z)v)x(z) + o(v). \end{align} Thus $$Dy(z)v = A(z)Dx(z)v + (DA(z)v)x(z).$$ The $j$th column of $Dy(z)$ is therefore \begin{align} \frac{\partial y}{\partial z_j}(z) &= Dy(z)e_j \\ &= A(z)Dx(z)e_j + (DA(z)e_j)x(z) \\ &= A(z)\frac{\partial x}{\partial z_j}(z) + \frac{\partial A}{\partial z_j}(z)x(z). \end{align} It actually does look identical to the product rule! This is no coincidence since the proof of the above rule was identical to the proof of the product rule.