Backpropagation with two hidden layers - matrix dimension doesn't add up

Question

Backpropagation with two hidden layers - matrix dimension doesn't add up

347 Views Asked by Bumbble Comm At 10 May 2026 - 8:14

I'm currently trying to create a neural network with 2 hidden layers from scratch.

The input layer has 784 dimensions (MNIST dataset)
The first hidden layer has 100 neurons using sigmoid activator
The second hidden layer has 10 neurons using sigmoid activator
The output layer has 10 possible outcomes (digit 0-9) using softmax activator

I easily computed the output (third) layer derivative with respect to the final (third) weight matrix: $$\frac{\partial L}{\partial W_3} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_3} \frac{\partial z_3}{\partial W_3} = [1\ H_2]^T(\hat{y} - y)$$

$\frac{\partial L}{\partial z_3} = (\hat{y} - y)$ is the partial derivative of the loss function with respect to the softmax input.

$\frac{\partial z3}{\partial z_3} = \frac{\partial}{\partial z_3} [1\ H_2]W_3 = [1\ H_2]^T$ is the partial derivative of the softmax input with respect to the weight matrix.

I checked the dimension, and since $[1\ H_2]^T \in \mathbb{R}^{11\times1}$ and $(\hat{y} - y) \in \mathbb{R}^{1\times10}$, $\frac{\partial L}{\partial W_3}\in \mathbb{R}^{11\times10}$, which has the same dimension as $W_3$. Thus, it would be possible to perform gradient descent.

Next up, I want to compute the derivative of loss function with respect to the second layer's weight matrix:

$$\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_3} \frac{\partial z_3}{\partial H_2} \frac{\partial H_2}{\partial z_2} \frac{\partial z_2}{\partial W_2}$$

$\frac{\partial L}{\partial z_3} = (\hat{y} - y) \in \mathbb{R}^{1\times10}$
$\frac{\partial z_3}{\partial H_2} = \frac{\partial}{\partial H_2} [1\ H_2]W_3=W_3^T \in \mathbb{R}^{10\times10}$
$\frac{\partial H_2}{\partial z_2} = \frac{\partial}{\partial z_2} \sigma(z_2)=\sigma(z_2)(1-\sigma(z_2))=H_2(1-H_2) \in \mathbb{R}^{10\times1}$
$\frac{\partial z_2}{\partial W_2} = \frac{\partial}{\partial W_2} [1\ H_1]W_2=[1\ H_1] \in \mathbb{R}^{1\times101}$

I can't merge these derivatives using chain rule into $Dim(W_2) \in \mathbb{R}^{101\times10}$ no matter what, which is needed for gradient descent. I feel like I'm missing something in my $\frac{\partial L}{\partial W_2}$. Could someone please give me some insights?

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2018-10-27 19:46:10

I found the answer, turns out the derivative can be solved as shown below:

$[1\ H_1]^T(\hat{y}-y)W_3^TH_2(1-H_2) \in \mathbb{R}^{101\times1}$

This is the same dimension as $W_2$, so it is possible to compute gradient descent.

Backpropagation with two hidden layers - matrix dimension doesn't add up

There are 1 best solutions below

Related Questions in CALCULUS

Related Questions in LINEAR-ALGEBRA

Related Questions in DERIVATIVES

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions