Intuitively, why does squaring a loss function change optimal values?

168 Views Asked by At

In many optimization problems, it is clear that by performing a non-linear operation we change the outcome of any potential optimal values.

For example in machine learning: summing over errors (manhattan distance) vs squared errors (euclidean distance)

$$ \sum|x-y|$$ $$ \sum(x-y)^2$$

However, intuitively I am having trouble answering why this is the case. A voice in the back of my head keeps saying

bigger numbers will get bigger and smaller numbers will get smaller, thus, the biggest number will still be the biggest and the smallest number will still be the smallest.

I know this is not the case, but can someone offer some insight and examples into perhaps why?

2

There are 2 best solutions below

1
On BEST ANSWER

When you square before summing, the big errors have more weight and pull the fit harder in their direction. As a specific example, suppose we have the data $\{0,1,2,3,10\}$ and want to find the best fit. If we use the sum of the absolute errors we get the median, $2$. If we use the sum of the squared errors we get the mean, $\frac{16}5$. Minimizing the squared errors corresponds nicely (in the very unlikely case that the error distribution is normal) to maximizing the probability that the fitted values are correct.

0
On

Think about finding a point on the real line that's as close as possible to the points $0$ and $1$. If you use a sum of manhattan distances, you get $$ D(x) = |x-0| + | 1 - x |. $$

For any $x$ between $0$ and $1$, the value of $D(x)$ is exactly 1. Outside that interval, $D$ is always greater than 1. So: manhattan distance gives you many many optima.

Now look at $$ E(x) = (x-0)^2 + (1 - x)^2. $$

$E$ is evidently minimized at $x = 0.5$, so there's a unique optimum, and it's nicely "balanced" between the two anchor points.

Everything you need to know is really contained in this one example; it's worth a bit of study.