I am learning about backpropagation in neural networks and do not understand how the matrix multiplication works in that context. I set up this example network:
I now want to get the derivative of the cost for a single training example with respect to the weights of the first layer. The formula is, with vectors denoted in bold:
$\frac{\partial C_k}{\partial \textbf W^{(1)}} = \frac{\partial C_k}{\partial \textbf a^{(2)}}.\frac{\partial \textbf a^{(2)}}{\partial \textbf a^{(1)}}.\frac{\partial \textbf a^{(1)}}{\partial \textbf z^{(1)}}.\frac{\partial \textbf z^{(1)}}{\partial \textbf W^{(1)}}$
if we unwrap $\frac{\partial \textbf a^{(2)}}{\partial \textbf a^{(1)}}$ into $\frac{\partial \textbf a^{(2)}}{\partial \textbf z^{(2)}}.\frac{\partial \textbf z^{(2)}}{\partial \textbf a^{(1)}}$ we get
$\frac{\partial C_k}{\partial \textbf W^{(1)}} = \frac{\partial C_k}{\partial \textbf a^{(2)}}.\frac{\partial \textbf a^{(2)}}{\partial \textbf z^{(2)}}.\frac{\partial \textbf z^{(2)}}{\partial \textbf a^{(1)}}.\frac{\partial \textbf a^{(1)}}{\partial \textbf z^{(1)}}.\frac{\partial \textbf z^{(1)}}{\partial \textbf W^{(1)}}$
$\frac{\partial C_k}{\partial \textbf W^{(1)}}$ should have the same dimensions as the weight matrix of the same layer, which is $\textbf W_1 = \begin{bmatrix}w_{00}^{(1)} & w_{01}^{(1)} & w_{02}^{(1)} & w_{03}^{(1)} \\w_{10}^{(1)} & w_{11}^{(1)} & w_{12}^{(1)} & w_{13}^{(1)} \\w_{20}^{(1)} & w_{21}^{(1)} & w_{22}^{(1)} & w_{23}^{(1)} \\\end{bmatrix}$ where each row is a vector of $\textbf a^{(1)}$ holding each neuron's weights.
I do not understand how to multiply or dot the matrices and vectors of the upper formula (written out below) with each other so as to receive the $\frac{\partial C_k}{\partial \textbf W^{(1)}}$ matrix:
$\frac{\partial C}{\partial \textbf W^{(1)}}= \begin{bmatrix}\frac{C_k}{\partial \textbf a_0^{(2)}}\\ \frac{C_k}{\partial \textbf a_1^{(2)}} \\\end{bmatrix}.\begin{bmatrix}\frac{\textbf a_0^{(2)}}{\partial \textbf z_0^{(2)}}\\ \frac{\textbf a_1^{(2)}}{\partial \textbf z_1^{(2)}}\\\end{bmatrix}.\begin{bmatrix}\frac{\textbf z_0^{(2)}}{\partial \textbf a_0^{(1)}}\frac{\textbf z_0^{(2)}}{\partial \textbf a_1^{(1)}}\frac{\textbf z_0^{(2)}}{\partial \textbf a_2^{(1)}}\\ \frac{\textbf z_1^{(2)}}{\partial \textbf a_0^{(1)}}\frac{\textbf z_1^{(2)}}{\partial \textbf a_1^{(1)}}\frac{\textbf z_1^{(2)}}{\partial \textbf a_2^{(1)}} \\\end{bmatrix}.\begin{bmatrix}\frac{\textbf a_0^{(1)}}{\partial \textbf z_0^{(1)}}\\ \frac{\textbf a_1^{(1)}}{\partial \textbf z_1^{(1)}}\\\frac{\textbf a_2^{(1)}}{\partial \textbf z_2^{(1)}}\\\end{bmatrix}.\begin{bmatrix}\frac{\partial \textbf z_0^{(1)}}{\partial \textbf w_{00}^{(1)}}\frac{\partial \textbf z_0^{(1)}}{\partial \textbf w_{01}^{(1)}}\frac{\partial \textbf z_0^{(1)}}{\partial \textbf w_{02}^{(1)}}\frac{\partial \textbf z_0^{(1)}}{\partial \textbf w_{03}^{(1)}}\\ \frac{\partial \textbf z_1^{(1)}}{\partial \textbf w_{10}^{(1)}}\frac{\partial \textbf z_1^{(1)}}{\partial \textbf w_{11}^{(1)}}\frac{\partial \textbf z_1^{(1)}}{\partial \textbf w_{12}^{(1)}}\frac{\partial \textbf z_1^{(1)}}{\partial \textbf w_{13}^{(1)}}\\\frac{\partial \textbf z_2^{(1)}}{\partial \textbf w_{20}^{(1)}}\frac{\partial \textbf z_2^{(1)}}{\partial \textbf w_{21}^{(1)}}\frac{\partial \textbf z_2^{(1)}}{\partial \textbf w_{22}^{(1)}}\frac{\partial \textbf z_2^{(1)}}{\partial \textbf w_{23}^{(1)}}\\\end{bmatrix}$
I figured it would be element-wise multiplication for the first three matrices and then dotting them with the fourth, but that'd get me an 2 by 1 matrix that's not dottable with the last one. I'd appreciate any advice on what mistake there is in my thinking with this particular example. I consumed a dozen videos and webpage advices but as every of them has its own variables and technique (summation notation, other variables, etc) they could not really help me so far but led to even more confusion.