Neural network cost function - why squared error?

10.2k Views Asked by At

Question: Why is the squared error most often used for training neural networks?

Context:

  • Neural networks are trained by adjusting the link weights.
  • The key factor that informs these adjustments is how "wrong" the untrained network is.
  • The actual output and the desired output (training examples) will be different in an untrained network.
  • This difference (target - actual) is the error. There is an error for each output layer node.

Learning:

  • The link weights are adjusted so that the overall error of the network is minimised for each training example.
  • Most guides will use a cost function $Cost=\sum_{output\ nodes}{(target-actual)^2}$ to minimise. They often don't give a reason for this function, but if they do they say:

    1. the (target-actual)2 is always positive so errors of different signs don't cancel out, misrepresenting how "wrong" the network is,
    2. the cost function is differentiable so we can work out the sensitivity of the error to individual weights, the partial differential $\frac{\delta{Cost}}{\delta{w}}$.
    3. parallels to linear regression, where errors are assumed to be Gaussian, that is, distributed normally around the true value

Challenge:

  • We don't need the cost function to always be positive because we don't actually sum over the errors from each node. Instead we consider each output node in turn when we propagate the errors back. The errors at the output nodes are independent of each other.
  • You can find $\frac{\delta{Error}}{\delta{w}}$ from the simpler $Error=(target-actual)$. There is no problem deriving the partial differentials from this $Error$ expression. Instead of trying to minimise it, we instead move towards zero from either direction.
  • There is no reason to believe the errors in the output of a neural network are distributed normally.

Derivation of Simpler Weight Learning

  • The error at the $jth$ output node is (target - actual), or $e_j = (t_j - o_j)$.
  • The actual output, $o_j = sigmoid(\sum_i{w_i.x_i})$ where $w_i$ is weight of the $ith$ link to that $jth$ node, and $x_i$ is the output from the preceding $ith$ node. Sigmoid is the popular squashing function $\frac{1}{(1+e^{-x})}$.
  • So $\frac{\delta{e_j}}{\delta{w_k}} = \frac{\delta{}}{\delta{w_k}}(t_j - o_j)= \frac{\delta{}}{\delta{w_k}}(-sigmoid(\sum_i{w_i.x_i}))$ because we can ignore the constant $t_j$. We can also see that only one $w_{i=k}$ matters as other $w_{i\neq k}$ don't contribute to the differential.
  • That leaves us with $\frac{\delta{e_j}}{\delta{w_k}} = - \frac{\delta{}}{\delta{w_k}}(sigmoid(w_k. x_k)) = - (w_k.x_k)(1-w_k.x_k).x_k$ because we know how to differentiate the sigmoid.

So we have an expression for $\frac{\delta{e_j}}{\delta{w_k}}$ which can be used to guide the iterative refinement of the weight $w_k$ so that $e_j$ moves towards zero.

Let's unfold that a bit more. If the error at a node $e_j = (t_j - o_j)$ is positive, and the gradient with respect to weight $w_k$ is positive, we reduce the weight. If the gradient is negative, we increase the weight. In this way we inch towards zero error.

The opposite applies if the error is negative. If the gradient is positive we increase the weight a bit, and if it is negative, we decrease the weight a bit.

So what's wrong with this analysis? And why do so many textbooks, papers and guides not explain their choice of the squared error as a cost function?

I've looked through dozens of papers, dozens of website guides, and about 10 properr textbooks.

2

There are 2 best solutions below

0
On BEST ANSWER

I derived weight update expressions using the naive error cost function.

The key here is that we're not minimising it, we're trying to get it to zero - from both directions.

Seems to work well! See blog post, with example code and results.

http://makeyourownneuralnetwork.blogspot.co.uk/2016/01/why-squared-error-cost-function.html

3
On

Sum of squares is a usual way to turn multiobjective optimisation (we want all the nodes to have low error) to single objective (we optimise over a single number) , which is much more convenient for usual optimisation like gradient descent.

Since the error is not expected to be zero everywhere, the squares expresses that we would like the errors to have similar values (rather than having some nodes with very small error and others with very large). Scale is an issue in this reasoning, but since the output nodes values are bounded by sigmoidal functions this is not an issue.

I think your reasoning ultimately expresses the sum of absolute values of the errors. Indeed, when you can see that optimising a sum of absolute values is quite similar to optimising each one separately.

Sorry if I am being simplistic, I wrote this on my mobile, just to provide some intuition, not a complete answer