I'm trying to write out the calculations for backpropagation but I'm having trouble getting the final answer- I believe I should be getting something similar to $-(y - \sigma(w \cdot x + b))\sigma'(w \cdot x + b)$. I have checked questions related to backpropagation on the site, but my question specifically deals with expansion of the terms via the chain rule.
I have some cost function $C$, and I'm using mean squared error for it, therefore: $C = \frac{1}{2n}\sum_{i=0}^n(y_i-\hat{y_i})^2$
I'm using sigmoid activation for my neurons, therefore values for layer activation are given by: $a = \sigma(w \cdot x + b)$, where $w$ and $x$ are matrices that can be multiplied together.
To make the network "learn", the total cost/error of the network needs to be minimized, so weights $w$ and biases $b$ need to be updated for each neuron in the network using gradient descent. However, gradient descent requires the derivative of the cost function to be calculated. This means the function has to be derived with respect to the weights and biases. I've done the following steps so far for weights:
$C = \frac{1}{2n}\sum_{i=0}^n(y_i-\hat{y_i})^2$
$\frac{\partial C}{\partial w} = \frac{1}{2n}\sum_{i=0}^n\frac{\partial}{\partial w}(y_i-\hat{y_i})^2$
$= \frac{1}{n}\sum_{i=0}^n(y_i-\hat{y_i})\frac{\partial}{\partial w}(y_i-\hat{y_i})$
$= \frac{1}{n}\sum_{i=0}^n(y_i-\hat{y_i})(\frac{\partial}{\partial w}y_i-\frac{\partial}{\partial w}\hat{y_i})$
$= \frac{1}{n}\sum_{i=0}^n(y_i-\hat{y_i})(-\frac{\partial}{\partial w}\hat{y_i})$
$= -\frac{1}{n}\sum_{i=0}^n(y_i-\hat{y_i})(\frac{\partial}{\partial w}\hat{y_i})$
This is where I get stuck. I understand that $\frac{\partial}{\partial w}\hat{y_i}$ can be expanded to $\frac{\partial C}{\partial a}\frac{\partial a}{\partial net}\frac{\partial net}{\partial w}$, and that $\frac{\partial a}{\partial net}$ will expand to $\sigma(w \cdot x + b)(1 - \sigma'(w \cdot x + b)) \frac{\partial a}{\partial net} (w \cdot x + b)$, but how do I complete the final derivations/substitutions and obtain $-(y - \sigma(w \cdot x + b))\sigma'(w \cdot x + b)$? Would this answer also apply to the derivative with respect to the bias?