How to derive matrices during backpropagation in neural network

325 Views Asked by At

I'm trying to understand the basic structure of feedforward neural networks by implementing a simple neural network with one hidden layer (3 input nodes, 4 hidden nodes, 1 output node). I've calculated the gradient of the cost-function (squared error) w.r.t. the weights between this hidden layer and the output layer ($\mathbf{w}^{(2)}$) using the chain rule, but when I try to calculate the gradient w.r.t. the weights between the input layer and hidden layer ($\mathbf{W}^{(1)}$) I encounter some dimensional problems. Because $\mathbf{W}^{(1)}$ is a matrix, I end up needing a vector-by-matrix derivate. $$ \frac{\partial C}{\partial \mathbf{W}^{(1)}} = \frac{\partial C}{\partial \hat y}\cdot{\frac{\partial \hat y}{\partial z^{(2)}}} \cdot{\frac{\partial z^{(2)}}{\partial \mathbf{a}} \cdot{\frac{\partial \mathbf{a}}{\partial \mathbf{z}^{(1)}}} \cdot{\frac{\partial \mathbf{z}^{(1)}}{\partial \mathbf{W}^{(1)}}}}$$

Where $\hat y$ is the outputted value given some input $\begin{bmatrix} x_1 \\ x_2 \\ x_3\end{bmatrix}$, $\mathbf{a} = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \\ a_4\end{bmatrix}$ is the activations of the hidden layer, $z^{(2)} = \sum_{i=1}^4w_i^{(2)}a_i$ and $\mathbf{z}^{(1)} = \begin{bmatrix} \sum_{j=1}^3w_{1j}^{(2)}x_j \\ \sum_{j=1}^3w_{2j}^{(2)}x_j \\ \sum_{j=1}^3w_{3j}^{(2)}x_j \\ \sum_{j=1}^3w_{4j}^{(2)}x_j\end{bmatrix}$. All activation functions are sigmoid.

I do not know how to calculate a vector-by-matrix derivative and therefor don't know how to calculate $\frac{\partial \mathbf{z}^{(1)}}{\partial \mathbf{W}^{(1)}}$. I don't even know if i am doing this correctly since all the articles I've read don't go into how to calculate the gradient beyond the first level of weights, and i do not have any previous experience in neural networks. I understand how to derive w.r.t. the weights individually, but don't know how to translate this into matrix-form other than just multiplying the matrices as hadamard products. What is the convention when calculating such gradients for backpropagation? I've read this slide from some course on neural networks where they mention vector-by-matrix derivatives, but since I don't know the context of the symbols I'm unable to figure it out.

I would really appreciate some help since this is a school project and can't be omitted.