Row-wise derivative of a vector with respect to a matrix

339 Views Asked by At

Consider a column $n\times 1$ vector $\overline {z}$ and an $n\times m$ matrix $W$. What would one call and denote an $n\times m$ matrix of derivatives defined by $M_{ij}=\frac{\partial z_i}{\partial W_{ij}}$?

To provide some context, this matrix appears when deriving the matrix form of backpropagation updates for neural networks, whereby $\overline {z}$ is the vector of inputs to a particular layer (which are linear combinations of actions received from the previous layer) and a particular training instance, and $W$ is the matrix of weights at that layer.

2

There are 2 best solutions below

1
On

$\def\p#1#2{\frac{\partial #1}{\partial #2}}$The network layer equation in matrix and index notation is $$\eqalign{ z &= Wa + b \\ z_i &= W_{i\ell}a_\ell +b_i \\ }$$ When taking the derivative in index notation, take care not to repeat an index, because that implies summation over that index (like the $\ell$-index in the above equation).

With that in mind, the derivative of $z$ with respect to $W$ should be calculated as $$\eqalign{ \p{z_i}{W_{kj}} &= \left(\p{W_{i\ell}}{W_{kj}}\right)a_\ell \\ &= \left(\delta_{ik}\delta_{j\ell}\right)a_\ell \\ &= \delta_{ik}\,a_j \\ }$$ This third-order tensor is the correct derivative to use for backpropagation problems.

If you set the index $k=i$, that usually implies summation (aka the Einstein convention) $$\eqalign{ \p{z_i}{W_{ij}} &= \delta_{ii}\,a_j \\ &= \sum_{i=1}^N \delta_{ii}\,a_j \\ &= Na_j \\ }$$ If instead, you wanted to equate the indices without summing then you'd obtain $$\eqalign{ M_{ij} = \delta_{ii}\,a_j = {\tt1}_i\,a_j \\ }$$ but this matrix is not the right quantity to use for backprop.

0
On

Suppose z is nx1 and W is nxm. The derivative of z wrt W is given by a three-D array with $$F_{ijk}=\frac{dz_i}{dW_{jk}}$$

For example the derivative of z1 wrt W is given by the matrix

$$F_{1jk}=\begin{pmatrix}[\frac{dz_1}{dW_{11}}&...&\frac{dz_1}{dW_{1m}}]\\...\\\frac{dz_1}{dW_{n1}}&...&\frac{dz_1}{dW_{nm}}\end{pmatrix}$$

and you are only taking the first row. There are n total of these $F_{ijk}$'s for i from 1 to n, and in each, you take the ith row, giving you your matrix that is "derivative of z with respect to the corresponding row of W." If $F_{ijk}$ is the derivative of z wrt W, you can represent your matrix by

$$G_{ik}=F_{iik}$$

There may be reasons why you want to ignore the derivatives of z wrt rows other than the row in W that is the current index of z, e.g. they could all be 0's. Here is an example, suppose that z=Wx with x an mx1 vector:

$\underbrace{z}_{\begin{pmatrix}z_1\\...\\z_n\end{pmatrix}}=\underbrace{W}_{\begin{pmatrix}W_{11}&...&W_{1m}\\...\\W_{n1}&...&W_{nm}\end{pmatrix}}\underbrace{x}_{\begin{pmatrix}x_1\\...\\x_m\end{pmatrix}}$ Notice that $z_i=\sum_{j=1}^m W_{ij}xj$ for i=1,...,n. Therefore the derivative of $z_i$ wrt $W_{jk}$ is 0 for $j\ne i$. Take an example: $z_3=\sum_{j=1}^m W_{3j}x_j$. We have $\frac{dz_3}{dW_{41}}=0$. Anyway the derivative of z3 wrt W would look like $$\begin{pmatrix}0&0&...&0\\0&0&...&0\\ [x_1&x_2&...&x_m]\\0&0&...&0\\...\\0&0&...&0\end{pmatrix}$$

that is, $F_{33k}=x_k$ but $F_{3jk}=0$ for $j\ne 3$. In this case, again you can represent the otherwise 3-D array as a 2-D matrix without loss of information with

$$G_{ik}=F_{iik}$$

and in this particular case

$$G_{ik}=x_{k}$$

that is,

$$G=\begin{pmatrix}x_1&x_2&...&x_m\\x_1&x_2&...&x_m\\...\\x_1&x_2&...&x_m\end{pmatrix}$$

capturing all the information that the derivative wrt W would have had. Maybe the fact that you mentioned "linear combinations" means that something like this, but not as simple as just z=Wx, is happening with your matrix derivatives.