I'm currently trying to create a neural network with 2 hidden layers from scratch.
- The input layer has 784 dimensions (MNIST dataset)
- The first hidden layer has 100 neurons using sigmoid activator
- The second hidden layer has 10 neurons using sigmoid activator
- The output layer has 10 possible outcomes (digit 0-9) using softmax activator
I easily computed the output (third) layer derivative with respect to the final (third) weight matrix: $$\frac{\partial L}{\partial W_3} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_3} \frac{\partial z_3}{\partial W_3} = [1\ H_2]^T(\hat{y} - y)$$
$\frac{\partial L}{\partial z_3} = (\hat{y} - y)$ is the partial derivative of the loss function with respect to the softmax input.
$\frac{\partial z3}{\partial z_3} = \frac{\partial}{\partial z_3} [1\ H_2]W_3 = [1\ H_2]^T$ is the partial derivative of the softmax input with respect to the weight matrix.
I checked the dimension, and since $[1\ H_2]^T \in \mathbb{R}^{11\times1}$ and $(\hat{y} - y) \in \mathbb{R}^{1\times10}$, $\frac{\partial L}{\partial W_3}\in \mathbb{R}^{11\times10}$, which has the same dimension as $W_3$. Thus, it would be possible to perform gradient descent.
Next up, I want to compute the derivative of loss function with respect to the second layer's weight matrix:
$$\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_3} \frac{\partial z_3}{\partial H_2} \frac{\partial H_2}{\partial z_2} \frac{\partial z_2}{\partial W_2}$$
- $\frac{\partial L}{\partial z_3} = (\hat{y} - y) \in \mathbb{R}^{1\times10}$
- $\frac{\partial z_3}{\partial H_2} = \frac{\partial}{\partial H_2} [1\ H_2]W_3=W_3^T \in \mathbb{R}^{10\times10}$
- $\frac{\partial H_2}{\partial z_2} = \frac{\partial}{\partial z_2} \sigma(z_2)=\sigma(z_2)(1-\sigma(z_2))=H_2(1-H_2) \in \mathbb{R}^{10\times1}$
- $\frac{\partial z_2}{\partial W_2} = \frac{\partial}{\partial W_2} [1\ H_1]W_2=[1\ H_1] \in \mathbb{R}^{1\times101}$
I can't merge these derivatives using chain rule into $Dim(W_2) \in \mathbb{R}^{101\times10}$ no matter what, which is needed for gradient descent. I feel like I'm missing something in my $\frac{\partial L}{\partial W_2}$. Could someone please give me some insights?
I found the answer, turns out the derivative can be solved as shown below:
$[1\ H_1]^T(\hat{y}-y)W_3^TH_2(1-H_2) \in \mathbb{R}^{101\times1}$
This is the same dimension as $W_2$, so it is possible to compute gradient descent.