In many optimization problems, it is clear that by performing a non-linear operation we change the outcome of any potential optimal values.
For example in machine learning: summing over errors (manhattan distance) vs squared errors (euclidean distance)
$$ \sum|x-y|$$ $$ \sum(x-y)^2$$
However, intuitively I am having trouble answering why this is the case. A voice in the back of my head keeps saying
bigger numbers will get bigger and smaller numbers will get smaller, thus, the biggest number will still be the biggest and the smallest number will still be the smallest.
I know this is not the case, but can someone offer some insight and examples into perhaps why?
When you square before summing, the big errors have more weight and pull the fit harder in their direction. As a specific example, suppose we have the data $\{0,1,2,3,10\}$ and want to find the best fit. If we use the sum of the absolute errors we get the median, $2$. If we use the sum of the squared errors we get the mean, $\frac{16}5$. Minimizing the squared errors corresponds nicely (in the very unlikely case that the error distribution is normal) to maximizing the probability that the fitted values are correct.