I tried taking the derivative for a neural network's sigmoid function below but I am getting a slightly different answer and I'm not sure why. I am trying to follow this blog's derivation: https://selbydavid.com/2018/01/09/neural-network/
I would like to take the derivative of the following with respect to $W_{out}$
$\hat y = \sigma(HW_{out}) $
where $\sigma$ is the sigmoid function $\frac{1}{(1+e^{-x})}$
Note: H is a n x 6 matrix and $W_{out}$ is a 6 x 1 vector. This means that the derivative w.r.t. $W_{out}$ should be a n x 1 vector. $\hat y$ is also a n x 1 vector.
After trying to calculate the derivative $\frac{\partial}{\partial W_{out}} \sigma(HW_{out})$, I ended up with:
$\frac{\partial}{\partial W_{out}} \sigma(HW_{out})$ = $\sigma(HW_{out})(1-\sigma(HW_{out})H$
However, the correct answer should've been:
$\frac{\partial}{\partial W_{out}} \sigma(HW_{out})$ = $H^T\sigma(HW_{out})(1-\sigma(HW_{out})$
I don't really understand where H transposed came from. I would greatly appreciate it if someone could walk me through this step-by-step. If it helps, I can post my hand-written derivation.
Let $h(W)=HW$. Then $$ \nabla(\sigma\circ h) = (Dh)^T \nabla\sigma = H^T \nabla\sigma $$ by using the chain rule for the gradient.