Neural network cost function - why squared error?

10.2k Views Asked by At

Question: Why is the squared error most often used for training neural networks?

Context:

  • Neural networks are trained by adjusting the link weights.
  • The key factor that informs these adjustments is how "wrong" the untrained network is.
  • The actual output and the desired output (training examples) will be different in an untrained network.
  • This difference (target - actual) is the error. There is an error for each output layer node.

Learning:

  • The link weights are adjusted so that the overall error of the network is minimised for each training example.
  • Most guides will use a cost function $Cost=\sum_{output\ nodes}{(target-actual)^2}$ to minimise. They often don't give a reason for this function, but if they do they say:

    1. the (target-actual)2 is always positive so errors of different signs don't cancel out, misrepresenting how "wrong" the network is,
    2. the cost function is differentiable so we can work out the sensitivity of the error to individual weights, the partial differential $\frac{\delta{Cost}}{\delta{w}}$.
    3. parallels to linear regression, where errors are assumed to be Gaussian, that is, distributed normally around the true value

Challenge:

  • We don't need the cost function to always be positive because we don't actually sum over the errors from each node. Instead we consider each output node in turn when we propagate the errors back. The errors at the output nodes are independent of each other.
  • You can find $\frac{\delta{Error}}{\delta{w}}$ from the simpler $Error=(target-actual)$. There is no problem deriving the partial differentials from this $Error$ expression. Instead of trying to minimise it, we instead move towards zero from either direction.
  • There is no reason to believe the errors in the output of a neural network are distributed normally.

Derivation of Simpler Weight Learning

  • The error at the $jth$ output node is (target - actual), or $e_j = (t_j - o_j)$.
  • The actual output, $o_j = sigmoid(\sum_i{w_i.x_i})$ where $w_i$ is weight of the $ith$ link to that $jth$ node, and $x_i$ is the output from the preceding $ith$ node. Sigmoid is the popular squashing function $\frac{1}{(1+e^{-x})}$.
  • So $\frac{\delta{e_j}}{\delta{w_k}} = \frac{\delta{}}{\delta{w_k}}(t_j - o_j)= \frac{\delta{}}{\delta{w_k}}(-sigmoid(\sum_i{w_i.x_i}))$ because we can ignore the constant $t_j$. We can also see that only one $w_{i=k}$ matters as other $w_{i\neq k}$ don't contribute to the differential.
  • That leaves us with $\frac{\delta{e_j}}{\delta{w_k}} = - \frac{\delta{}}{\delta{w_k}}(sigmoid(w_k. x_k)) = - (w_k.x_k)(1-w_k.x_k).x_k$ because we know how to differentiate the sigmoid.

So we have an expression for $\frac{\delta{e_j}}{\delta{w_k}}$ which can be used to guide the iterative refinement of the weight $w_k$ so that $e_j$ moves towards zero.

Let's unfold that a bit more. If the error at a node $e_j = (t_j - o_j)$ is positive, and the gradient with respect to weight $w_k$ is positive, we reduce the weight. If the gradient is negative, we increase the weight. In this way we inch towards zero error.

The opposite applies if the error is negative. If the gradient is positive we increase the weight a bit, and if it is negative, we decrease the weight a bit.

So what's wrong with this analysis? And why do so many textbooks, papers and guides not explain their choice of the squared error as a cost function?

I've looked through dozens of papers, dozens of website guides, and about 10 properr textbooks.

4

There are 4 best solutions below

0
On BEST ANSWER

I derived weight update expressions using the naive error cost function.

The key here is that we're not minimising it, we're trying to get it to zero - from both directions.

Seems to work well! See blog post, with example code and results.

http://makeyourownneuralnetwork.blogspot.co.uk/2016/01/why-squared-error-cost-function.html

3
On

Sum of squares is a usual way to turn multiobjective optimisation (we want all the nodes to have low error) to single objective (we optimise over a single number) , which is much more convenient for usual optimisation like gradient descent.

Since the error is not expected to be zero everywhere, the squares expresses that we would like the errors to have similar values (rather than having some nodes with very small error and others with very large). Scale is an issue in this reasoning, but since the output nodes values are bounded by sigmoidal functions this is not an issue.

I think your reasoning ultimately expresses the sum of absolute values of the errors. Indeed, when you can see that optimising a sum of absolute values is quite similar to optimising each one separately.

Sorry if I am being simplistic, I wrote this on my mobile, just to provide some intuition, not a complete answer

1
On

Understanding this answer demands some basic familarity with probability and gaussian distributions. If you are commfortable with these topics, then continue, else first look up for them in the internet and then come back to this answer.

The mean square error is a consequence of performing maximum log-likelihood estimation over the conditional probability distribution of the output. Maximizing the log likelohood is same as minimizing binary cross entropy between the true data generating distribution and the distribution learned by the model. Since this question is not about explaining cross entropy, I'll give a reference and skip that. https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/

What is maximum likelihood estimation ?

Maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. enter image description here

Linear regression as maximum likelihood

The maximum likelihood estimator can readily be generalized to the case where our goal is to estimate a conditional probability P(y | x; θ) in order to predict y given x. This is actually the most common situation because it forms the basis for most supervised learning. If X represents all our inputs and Y all our observed targets, then the conditional maximum likelihood estimator is enter image description here enter image description here The above screenshots were taken from the Deeplearning book by Ian Goodfellow et al. To read more about estimators read this chapter: http://www.deeplearningbook.org/contents/ml.html

Now we have the base to consider maxmimum likelihood estimation in the context of neural networks.

Mean square error in neural networks

enter image description hereenter image description here The above screenshots were taken from book: Pattern recognition and machine learning by Bishop.

Hope this helps in understanding use of mean squared error in machine learning. If you have any queries or issues regarding this answer, please write them down in comments.

1
On

Sum of squares is quite standard cost function in linear regression modelling. For multivariate classification, more commonly cross entropy is used, in addition to the softmax formula applied to the multivariate output.

When it comes to your question, I will base my answer on the assumption that what you are proposing here is the usage of the new (simpler?) error function: $Cost = (target - output)$ (written in the vector form). (Sorry, I do not have reputation high enough to ask you for a clarification in the comment.)

First, let us recall that vanilla gradient update per weight in a neural network is as follows: $\ w_i = \ w_i - \eta \cdot \frac{\delta{Cost}}{\delta{w_i}} \tag{1}$ where $\eta $ is the learning rate.

Now let's look at the example of logistic neuron having two inputs $ x_1 $ and $ x_2$, two weights $ w_1 $ and $ w_2$ (associated with each input) and one bias $ b $. Logistic neuron applies sigmoid/logistic function $\sigma(z) $ to the value $ z = x_1*w_1 + x_2*w_2 + b $, so the produced output of the neuron is as follows:

$output = \sigma(z) \equiv \frac{1}{1+e^{-z}}\tag{2} $

Suppose we are using sum of squares cost function, with one training sample in online learning:

$Cost=1/2*{(target-output)^2}\tag{3}$

We would have the following derivative for weight $w_1$ (based on (3) and (2)):

$\frac{\delta{Cost}}{\delta{w_1}} = \frac{\delta{Cost}}{\delta\sigma(z)}*\frac{\delta\sigma(z)}{\delta\ z}*\frac{\delta\ z}{\delta \ w_1} = -(target - output)*\sigma(z)(1-\sigma(z))*x_1 \tag{4}$

So when we put (4) into (1), we would have the following weight update:

$\ w_1 = \ w_1 + \eta*(target - output)*\sigma(z)(1-\sigma(z))*x_1 \tag{5}$

So each weight is updated proportionally to the difference between the predicted (output) and true (target value) and proportionally to the value of the input.

Instead, if we would use the error function $Cost = (target - output)$ and modify formula (1) as follows:

$\ w_i = \ w_i (-/+) \eta \cdot \frac{\delta{Cost}}{\delta{w_i}} \tag{6}$ so that is uses + if (target - output) is negative and - otherwise, then formula (5) would look as follows:

$\ w_1 = \ w_1 (-/+) \eta*\sigma(z)(1-\sigma(z))*x_1 \tag{7}$

As you can notice, derivatives would not be proportional to the difference between the predicted (output) and true (target value) output. In my opinion, there is nothing wrong there, but with first approach, formula (1) is consistent and it works with many cost functions, and what is more important, the learning process is faster as the updated weights will not oscillate around optimal solution - as the learning converges derivatives are smaller and smaller (I did not try to simulate this.)