Partial derivative of a matrix w.r.t. another matrix

40 Views Asked by At

I'm currently studying neural networks and how backpropagation works and had a question regarding the chain rule w.r.t. partial derivatives. The particular exercise problem that I'm solving is as follows:

We have an input matrix $X \in \Bbb{R}^{N \times D}$ with $N$ samples, each of dimension $D$. We have two weight matrices and two corresponding bias term vectors, as as follows:

$$ \begin{align} W_1 & \in \Bbb{R}^{D \times H} \\ b_1 & \in \Bbb{R}^H \\ W_2 & \in \Bbb{R}^{H \times C} \\ b_2 & \in \Bbb{R}^C \end{align} $$

The neural network presented in the problem can be depicted as follows:

$$ \begin{align} H_0 & = XW_1 \in \Bbb{R}^{N \times H} \\ H_1 & = H_0 + b_1 \in \Bbb{R}^{N \times H} \\ H_2 & = \max\{0,\ H_1\} \in \Bbb{R}^{N \times H} \\ H_3 & = H_2 W_2 \in \Bbb{R}^{N \times C} \\ Z & = H_3 + b_2 \in \Bbb{R}^{N \times C} \end{align} $$

Each $H_i$ represents a node in the computational graph, with $Z$ being the final output.

I've managed to calculate the gradients of each node as follows:

$$ \begin{align} \frac{\partial L}{\partial b_2} & = \frac{\partial L}{\partial Z} \times \frac{\partial Z}{\partial b_2} = \frac{\partial L}{\partial Z} \cdot \vec{\mathbf{1}} \\ \frac{\partial L}{\partial H_3} & = \frac{\partial L}{\partial Z} \times \frac{\partial Z}{\partial H_3} = \frac{\partial L}{\partial Z} \times \ ? \\ \frac{\partial L}{\partial W_2} & = \frac{\partial L}{\partial H_3} \times \frac{\partial H_3}{\partial W_2} = \frac{\partial L}{\partial H_3} \times H_2 \\ \frac{\partial L}{\partial H_2} & = \frac{\partial L}{\partial H_3} \times \frac{\partial H_3}{\partial H_2} = \frac{\partial L}{\partial H_3} \times W_2 \\ \frac{\partial L}{\partial H_1} & = \frac{\partial L}{\partial H_2} \times \frac{\partial H_2}{\partial H_1} = \frac{\partial L}{\partial H_2} \times \mathbf{1}_{(H_1 \gt 0)} \\ \frac{\partial L}{\partial H_0} & = \frac{\partial L}{\partial H_1} \times \frac{\partial H_1}{\partial H_0} = \frac{\partial L}{\partial H_1} \times \ ? \\ \frac{\partial L}{\partial W_1} & = \frac{\partial L}{\partial H_0} \times \frac{\partial H_0}{W_1} = \frac{\partial L}{\partial H_0} \times X \\ \frac{\partial L}{\partial X} & = \frac{\partial L}{\partial H_0} \times \frac{\partial H_0}{\partial X} = \frac{\partial L}{\partial H_0} \times W_1 \end{align} $$

There are two question marks I added. These are the resulting partial derivatives $\partial Z / \partial H_3$ and $\partial H_1 / \partial H_0$.

When we derive $\partial Z / \partial b_2$, $b_2$ is a vector with only one dimension, and so the derivative w.r.t. that would be a vector of $1$'s. However, does this same concept apply to matrices? How would we compute these partial derivatives? One thought that I had would be that in the case of matrices we would get the identity matrix, but I'm not sure if this thought is actually true or not.