Trouble with taking the derivative for neural network

248 Views Asked by At

I tried taking the derivative for a neural network's sigmoid function below but I am getting a slightly different answer and I'm not sure why. I am trying to follow this blog's derivation: https://selbydavid.com/2018/01/09/neural-network/

I would like to take the derivative of the following with respect to $W_{out}$

$\hat y = \sigma(HW_{out}) $

where $\sigma$ is the sigmoid function $\frac{1}{(1+e^{-x})}$

Note: H is a n x 6 matrix and $W_{out}$ is a 6 x 1 vector. This means that the derivative w.r.t. $W_{out}$ should be a n x 1 vector. $\hat y$ is also a n x 1 vector.

After trying to calculate the derivative $\frac{\partial}{\partial W_{out}} \sigma(HW_{out})$, I ended up with:

$\frac{\partial}{\partial W_{out}} \sigma(HW_{out})$ = $\sigma(HW_{out})(1-\sigma(HW_{out})H$

However, the correct answer should've been:

$\frac{\partial}{\partial W_{out}} \sigma(HW_{out})$ = $H^T\sigma(HW_{out})(1-\sigma(HW_{out})$

I don't really understand where H transposed came from. I would greatly appreciate it if someone could walk me through this step-by-step. If it helps, I can post my hand-written derivation.

1

There are 1 best solutions below

1
On

Let $h(W)=HW$. Then $$ \nabla(\sigma\circ h) = (Dh)^T \nabla\sigma = H^T \nabla\sigma $$ by using the chain rule for the gradient.