Question: Why is the squared error most often used for training neural networks?
Context:
- Neural networks are trained by adjusting the link weights.
- The key factor that informs these adjustments is how "wrong" the untrained network is.
- The actual output and the desired output (training examples) will be different in an untrained network.
- This difference (target - actual) is the error. There is an error for each output layer node.
Learning:
- The link weights are adjusted so that the overall error of the network is minimised for each training example.
Most guides will use a cost function $Cost=\sum_{output\ nodes}{(target-actual)^2}$ to minimise. They often don't give a reason for this function, but if they do they say:
- the (target-actual)2 is always positive so errors of different signs don't cancel out, misrepresenting how "wrong" the network is,
- the cost function is differentiable so we can work out the sensitivity of the error to individual weights, the partial differential $\frac{\delta{Cost}}{\delta{w}}$.
- parallels to linear regression, where errors are assumed to be Gaussian, that is, distributed normally around the true value
Challenge:
- We don't need the cost function to always be positive because we don't actually sum over the errors from each node. Instead we consider each output node in turn when we propagate the errors back. The errors at the output nodes are independent of each other.
- You can find $\frac{\delta{Error}}{\delta{w}}$ from the simpler $Error=(target-actual)$. There is no problem deriving the partial differentials from this $Error$ expression. Instead of trying to minimise it, we instead move towards zero from either direction.
- There is no reason to believe the errors in the output of a neural network are distributed normally.
Derivation of Simpler Weight Learning
- The error at the $jth$ output node is (target - actual), or $e_j = (t_j - o_j)$.
- The actual output, $o_j = sigmoid(\sum_i{w_i.x_i})$ where $w_i$ is weight of the $ith$ link to that $jth$ node, and $x_i$ is the output from the preceding $ith$ node. Sigmoid is the popular squashing function $\frac{1}{(1+e^{-x})}$.
- So $\frac{\delta{e_j}}{\delta{w_k}} = \frac{\delta{}}{\delta{w_k}}(t_j - o_j)= \frac{\delta{}}{\delta{w_k}}(-sigmoid(\sum_i{w_i.x_i}))$ because we can ignore the constant $t_j$. We can also see that only one $w_{i=k}$ matters as other $w_{i\neq k}$ don't contribute to the differential.
- That leaves us with $\frac{\delta{e_j}}{\delta{w_k}} = - \frac{\delta{}}{\delta{w_k}}(sigmoid(w_k. x_k)) = - (w_k.x_k)(1-w_k.x_k).x_k$ because we know how to differentiate the sigmoid.
So we have an expression for $\frac{\delta{e_j}}{\delta{w_k}}$ which can be used to guide the iterative refinement of the weight $w_k$ so that $e_j$ moves towards zero.
Let's unfold that a bit more. If the error at a node $e_j = (t_j - o_j)$ is positive, and the gradient with respect to weight $w_k$ is positive, we reduce the weight. If the gradient is negative, we increase the weight. In this way we inch towards zero error.
The opposite applies if the error is negative. If the gradient is positive we increase the weight a bit, and if it is negative, we decrease the weight a bit.
So what's wrong with this analysis? And why do so many textbooks, papers and guides not explain their choice of the squared error as a cost function?
I've looked through dozens of papers, dozens of website guides, and about 10 properr textbooks.
I derived weight update expressions using the naive error cost function.
The key here is that we're not minimising it, we're trying to get it to zero - from both directions.
Seems to work well! See blog post, with example code and results.
http://makeyourownneuralnetwork.blogspot.co.uk/2016/01/why-squared-error-cost-function.html