In my Machine Learning course I was told that some desirable properties for cost functions are:
When the target y is real-valued, it is desirable that the cost is symmetric around 0, since both positive and negative errors should be penalized equally.
Also, our cost function should penalize “large” mistakes and “very large” mistakes similarly.
Then we define mean squared error as:
$MSE(w) = \frac{1}{N} \sum_{n=1}^{N}[y_n - f(x_n)]^2$
My question is how would I verify the above properties mathematically? Are these properties verified no matter what the model $f$ is taken to be (in class we just saw linear regression)?
The error at a given point is, by definition, the value $y_n - f(x_n)$.
Let $e_n = y_n - f(x_n)$ be the error for point $n$. If you want to penalize as much negative errors than positive ones, it means that replacing $e_n$ by $-e_n$ will not change anything in the computation of your global cost. You can verify that it is the case.
Now, you can also remark that "large" $e_n$ will be more important in your cost function that "small" $e_n$ because of the square, however "very large" errors will be even more important than "large" errors -- I think your teacher intended the former.
In these two paragraphs, I made no assumptions about $f$, so the conclusions are independant of $f$.