I am learning about back-prop right now and in the course I am taking, the iterative back-propagation algorithm is given by:
$$\delta^L=\hat{y}^{(i)}-y^{(i)}$$ $$\delta^l=(\Theta ^l)^T\delta^{(l+1)}\circ sigmoid'(z^{l})$$ $$\Delta^l=\sum^{m}_{i=1}(\delta^{(l+1)})^{(i)}$$ (assuming vanilla gradient descent)
However I was never clear on how back-propagation was done wrt the bias nodes. Generally, I have considered (which may well be wrong) the bias units as an extra node that is not interconnected behind it. Keeping the "activation" at 1, and allowing the weights to control the output of the bias node is how I assumed that the network would control the bias. When working out $\delta$ in a non fully-interconnected network, there is a dimension mismatch arising from the mismatch of $s_{l+1}$ and $\Theta^T$.
I am assuming that it is better to just add the bias of the layer to the activations, but I was unable to find formulas on how to calculate the gradient with this additional term (as it is not dependent on the weights with this interpretation).
If someone could outline how to revise the biases within the relative framework described above, that would be great. Thanks.