Feedforward neural networks: how to obtain the gradient of the loss function with respect to the weight matrix of the second to last layer

46 Views Asked by Bumbble Comm At 25 Mar 2026 - 11:42

I have recently run into this page.

I am trying to obtain the equation labeled BP4 by the author.

Particularly, I want to obtain BP4 for the case of a feedforward neural network consisting of three layers —i.e. input layer, one single hidden layer and output layer—, each one with 2 neurons, 3 neurons and 1 neuron respectively.

The weight matrix from the input layer to the hidden layer is $$w^{L-1} = \begin{bmatrix}w^{L-1}_{11}&w^{L-1}_{12} \\ w^{L-1}_{21}&w^{L-1}_{22} \\ w^{L-1}_{31}&w^{L-1}_{32} \end{bmatrix}$$

On the other hand, $\hspace{2mm}a^{L-2} = \begin{bmatrix}x_1&x_2 \end{bmatrix}^T$ —that is, the input vector—.

Let us denote by the symbol $\times$ the matrix multiplication.

Well then, given this setup, the following equation holds: $$\frac{\partial C}{\partial {w^{L-1}}} = \delta^{L-1} \times \frac{\partial {(w^{L-1} \times a^{L-2} + b^{L-1} )}}{\partial(w^{L-1})}$$

The term $b^{L-1}$ is the bias of the hidden layer. Frankly, it doesn't matter because $\hspace{2mm}\frac{\partial {(w^{L-1} \times a^{L-2} + b^{L-1} )}}{\partial(w^{L-1})} = \frac{\partial {(w^{L-1} \times a^{L-2})}}{\partial(w^{L-1})}$

Now, since $w^{L-1}$ is a $3\text{x}2$ matrix and $a^{L-2}$ is a $2\text{x}1$ vector, we are dealing with the derivative of the product of a matrix times a vector with respect to a matrix. I don't know much about matrix calculus but, if I'm not mistaken, $$\frac{\partial {(w^{L-1} \times a^{L-2} )}}{\partial(w^{L-1})} = (a^{L-2})^T \otimes I_{2x2} \hspace{7mm} \text{where} \otimes \text{denotes the Kronecker product}$$

This derivative yields a $2\text{x}4$ matrix.

The symbol $\delta^{L-1}$ refers to the error committed by the neurons of the hidden layer. Since there are 3 neurons on that layer, $\delta^{L-1}$ is a $1\text{x}3$ vector.

I must have made some mistake up to this point because $\hspace{2mm}\delta^{L-1}\hspace{2mm}$ and $\hspace{2mm}(a^{L-2})^T \otimes I_{2x2}\hspace{2mm}$ are non-conformable on multiplication. Yet, I fail to see what is amiss.

Original Q&A

Feedforward neural networks: how to obtain the gradient of the loss function with respect to the weight matrix of the second to last layer

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions