square loss function in classification

2.7k Views Asked by At

I know the the square loss function in the regression context as follows:

$(y-f(x))^2$

for y the real, and f(x) the predicted value. This formulation is quite easy to understand: We have a convex loss function where the loss is based on the difference between real and predicted values, and outliers are penalized heavier by squaring this difference.

What I don't understand however is the following formulation as often found in a classification context:

$L(f(x),y) = (1 - y*f(x))^2$

First of all, we do not calculate the difference between real and predicted value, but multiply them - why? Given e.g. that both values would be large, would this amount to a large loss..? Secondly, for a binary classification problem, a missclassification where y = -1 and f(x) = 1, it seems that the loss would also equate to 4? Lastly, why to we have to subtract the product y*f(x) from 1?

1

There are 1 best solutions below

5
On BEST ANSWER

There are a few subtleties to the notation / standardization that are being missed.

Firstly, "1" in your expression for $L$ does not mean the scalar value 1: it means the all 1's vector. Similarly, $f(x)$ means the vector of $f$'s values at each data point, and $y$ is the corresponding vector of true (desired) values at each data point. The squaring is really 'sum of squares' -- or you can think of the squaring as dotting the vector with itself.

When (at one data point) $y = -1$ and $f = 1$, then $(1 - y * f)^2 = (1 + 1)^2 = 4$, a penalty. When $y = 1$ and $f = 1$, then $(1 - y*f)^2 = (1 - 1)^2 = 0$, indicating a perfect match and thus no penalty.

You can indeed show that this formulation is equivalent to the standard (in other areas) formulation of $(y-f(x))^2$:

$$({\bf 1} - y * f)^2 = {\bf 1}\cdot{\bf 1} - 2{\bf 1}\cdot(y * f) + (y*f)^2$$

The ${\bf 1}\cdot{\bf 1}$ is a constant and so irrelevant to the fit. ${\bf 1}\cdot(y*f) = y\cdot f$, as you can easily verify by inspecting the sum. And when $y$ consists entirely of $\pm 1$, then any factor of $y^2 = 1$, so the last term is just $f^2$ -- resulting in $-2 y\cdot f + f^2$. Compare with,

$$(y-f)^2 = y^2 - 2y\cdot f + f^2$$

Now again the leading $y^2$ is a constant that can be dropped, and the rest of the terms match up.

But, why in the field of machine learning is the $(1-yf)^2$ format preferred, if it's equivalent? In curve fitting, a sum-of-squares penalty makes a lot of sense: add up the error on each one. There are real valued errors and they are scaled appropriately. Machine learning is much more focused on classification which has a strict "yes"/"no" interpretation. The vector $y*f$ then indicates if something was correct or not, and is often manipulated different ways. $({\bf 1}-y*f)^2$ is one loss function; other popular ones are $\sum\log(1+y*f)$, also called perplexity; or $\sum\textrm{sign}(y*f)$, which is just accuracy.