What's the purpose of using a mean squared error?

Question

What's the purpose of using a mean squared error?

248 Views Asked by Bumbble Comm At 10 May 2026 - 6:25

This question might be a bit too much about machine learning but I think it's an appropriate question for this community.

Say I have set $\mathcal X$ that contains instances, vectors of features of length $n$. Let's say I have a set $\mathcal Y$ which represent target variables (and for brevity assume they are classifiable with $\mathcal Y = \{0,1\}$). I seek to find the correct mapping $f$ such that $\forall x_i \in \mathcal X, f(x_i) = y_i$.

Suppose I have a hypothesis mapping $h : \mathcal X \to \mathcal Y$. In terms of finding $f$ in a practical manner, I would have to:

Define an error function and minimize it with respect to the inputs of $h$.
Curve fit the elements $\mathcal X$ with the parameters that minimizes the cost function to find the exact parameters (the minimization of the error function if computed through numerical means such as gradient descent will not be perfect but close enough to approximate using a curve fit function such as scipy's curve_fit function.

I imagine it is sensible to relate $f$ with $h$ by $f(x) = h(x) + \epsilon$, where $\epsilon$ is the error of $h$. Why then, would I not be able to define my error function as follows, with training values $f(x_i) = y_i$...

$$\text{error function} = y_i - \epsilon$$

This seems reasonable enough. Instead, we seem to use things like mean squarred error instead. Why is this necessary?

Other questions:

Also, can I write $h$ as $h(x)$ or must I as $h(x_1,x_2,..,x_n)$ given my vectors have length $n$? I
Is the mean squared error sum over the length of the training set or length of the full data?
What if the error function has no minimum? What if it has more than one?

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2019-06-07 17:50:37

Let me answer your primary question. You want to minimize the error, so you want the error to be nonnegative and zero only if you have an exact match. Your proposal of $\text{error} = y_i - f(x_i)$ does not meet this criterion as the zero will be negative if $f(x_i)$ over predicts $y_i$, that is $f(x_i) > y_i$. Since you will be summing over all your data, you don't want "negative error" to cancel out "positive error".

An obvious fix to your proposal is to define $\text{error} = |y_i - f(x_i)|$, the so-called absolute error. So why use the mean-square error rather than the absolute error?

Well first of all sometimes you really do want to use the absolute error. For certain problems in machine learning and other fields you're trying to reconstruct a signal which is "sparse" (think of an imaging problem where most of the domain your imaging is empty space.) Then using the absolute error totally wins out over the mean-square error (see this, the absolute error is referred as L1 and the mean-square error is referred to as L2) for some sense of why). This is the basis of an entire field called compressive sensing.

But in many contexts, the mean-square error is totally appropriate and preferred over the absolute error. So why? Let me give a couple of answers:

The mean-square error is differentiable everywhere, whereas the absolute error is not. This is important for gradient descent-type algorithms which require differentiability.
The mean-square error is algebraically nice. You can write out everything using matrix-vector products and dot products and other vector notation. In addition to being convenient, this means many problems can be solved much faster. Linear regressions minimizing the mean-square error can be computed exactly using a single matrix factorization rather than an iterative procedure like gradient descent. This is not true of the absolute error, which requires the more costly linear programming.

In my experience thus-far, these are the main reasons, though there are certainly others.

What's the purpose of using a mean squared error?

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Related Questions in GRADIENT-DESCENT

Related Questions in ERROR-FUNCTION

Related Questions in MEAN-SQUARE-ERROR

Trending Questions

Popular # Hahtags

Popular Questions