I am considering a simple $2$-layer network with $m$ training data $(x^i, y^i)$ data whose cost function is
$$\ell(w,\alpha, \beta) := \sum_{i=1}^{m} \left( y^i - \sigma(w^Tz^i) \right)^2$$
where
$$\sigma(x) := \frac{1}{1+e^{-x}}$$
is the sigmoid function and $z_1^i = \sigma(\alpha^T x^i)$, and $z_2^i = \sigma(\beta^T x^i)$. The gradient is
$$\nabla_w \ell(w, \alpha, \beta) = - \sum_{i=1}^m 2(y^i - \sigma(u^i))\sigma(u^i)(1-\sigma(u^i)) z^i$$
where $(u^i) = w^Tz^i$
I am trying to derive the gradient by taking derivative over the given cost function, but am not able to do so. After applying chain rule, i get stuck at the term $\sigma(w^Tz^i)$ Any one could step me through please?
The cost function was
$$\ell(w,\alpha, \beta) := \sum_{i=1}^{m} \left( y^i - \sigma(w^Tz^i) \right)^2$$
Now take derivative $$=2 \sum_{i=1}^{m} \left( y^i - \sigma(w^Tz^i) \right)D(\sigma(w^Tz^i))$$
We note that $$D(\sigma(w^Tz^i))=\sigma(w^Tz^i)(1-\sigma(w^Tz^i))D(w^Tz^i)$$ $$=\sigma(w^Tz^i)(1-\sigma(w^Tz^i))z^i$$
And done...
(Note that we used the usual identity for sigmoid, the derivative is $f(x)(1-f(x))$.)