I have recently run into this page.
I am trying to obtain the equation labeled BP4 by the author.
Particularly, I want to obtain BP4 for the case of a feedforward neural network consisting of three layers —i.e. input layer, one single hidden layer and output layer—, each one with 2 neurons, 3 neurons and 1 neuron respectively.
The weight matrix from the input layer to the hidden layer is $$w^{L-1} = \begin{bmatrix}w^{L-1}_{11}&w^{L-1}_{12} \\ w^{L-1}_{21}&w^{L-1}_{22} \\ w^{L-1}_{31}&w^{L-1}_{32} \end{bmatrix}$$
On the other hand, $\hspace{2mm}a^{L-2} = \begin{bmatrix}x_1&x_2 \end{bmatrix}^T$ —that is, the input vector—.
Let us denote by the symbol $\times$ the matrix multiplication.
Well then, given this setup, the following equation holds: $$\frac{\partial C}{\partial {w^{L-1}}} = \delta^{L-1} \times \frac{\partial {(w^{L-1} \times a^{L-2} + b^{L-1} )}}{\partial(w^{L-1})}$$
The term $b^{L-1}$ is the bias of the hidden layer. Frankly, it doesn't matter because $\hspace{2mm}\frac{\partial {(w^{L-1} \times a^{L-2} + b^{L-1} )}}{\partial(w^{L-1})} = \frac{\partial {(w^{L-1} \times a^{L-2})}}{\partial(w^{L-1})}$
Now, since $w^{L-1}$ is a $3\text{x}2$ matrix and $a^{L-2}$ is a $2\text{x}1$ vector, we are dealing with the derivative of the product of a matrix times a vector with respect to a matrix. I don't know much about matrix calculus but, if I'm not mistaken, $$\frac{\partial {(w^{L-1} \times a^{L-2} )}}{\partial(w^{L-1})} = (a^{L-2})^T \otimes I_{2x2} \hspace{7mm} \text{where} \otimes \text{denotes the Kronecker product}$$
This derivative yields a $2\text{x}4$ matrix.
The symbol $\delta^{L-1}$ refers to the error committed by the neurons of the hidden layer. Since there are 3 neurons on that layer, $\delta^{L-1}$ is a $1\text{x}3$ vector.
I must have made some mistake up to this point because $\hspace{2mm}\delta^{L-1}\hspace{2mm}$ and $\hspace{2mm}(a^{L-2})^T \otimes I_{2x2}\hspace{2mm}$ are non-conformable on multiplication. Yet, I fail to see what is amiss.