This question might be a bit too much about machine learning but I think it's an appropriate question for this community.
Say I have set $\mathcal X$ that contains instances, vectors of features of length $n$. Let's say I have a set $\mathcal Y$ which represent target variables (and for brevity assume they are classifiable with $\mathcal Y = \{0,1\}$). I seek to find the correct mapping $f$ such that $\forall x_i \in \mathcal X, f(x_i) = y_i$.
Suppose I have a hypothesis mapping $h : \mathcal X \to \mathcal Y$. In terms of finding $f$ in a practical manner, I would have to:
Define an error function and minimize it with respect to the inputs of $h$.
Curve fit the elements $\mathcal X$ with the parameters that minimizes the cost function to find the exact parameters (the minimization of the error function if computed through numerical means such as gradient descent will not be perfect but close enough to approximate using a curve fit function such as scipy's
curve_fitfunction.
I imagine it is sensible to relate $f$ with $h$ by $f(x) = h(x) + \epsilon$, where $\epsilon$ is the error of $h$. Why then, would I not be able to define my error function as follows, with training values $f(x_i) = y_i$...
$$\text{error function} = y_i - \epsilon$$
This seems reasonable enough. Instead, we seem to use things like mean squarred error instead. Why is this necessary?
Other questions:
Also, can I write $h$ as $h(x)$ or must I as $h(x_1,x_2,..,x_n)$ given my vectors have length $n$? I
Is the mean squared error sum over the length of the training set or length of the full data?
What if the error function has no minimum? What if it has more than one?
Let me answer your primary question. You want to minimize the error, so you want the error to be nonnegative and zero only if you have an exact match. Your proposal of $\text{error} = y_i - f(x_i)$ does not meet this criterion as the zero will be negative if $f(x_i)$ over predicts $y_i$, that is $f(x_i) > y_i$. Since you will be summing over all your data, you don't want "negative error" to cancel out "positive error".
An obvious fix to your proposal is to define $\text{error} = |y_i - f(x_i)|$, the so-called absolute error. So why use the mean-square error rather than the absolute error?
Well first of all sometimes you really do want to use the absolute error. For certain problems in machine learning and other fields you're trying to reconstruct a signal which is "sparse" (think of an imaging problem where most of the domain your imaging is empty space.) Then using the absolute error totally wins out over the mean-square error (see this, the absolute error is referred as L1 and the mean-square error is referred to as L2) for some sense of why). This is the basis of an entire field called compressive sensing.
But in many contexts, the mean-square error is totally appropriate and preferred over the absolute error. So why? Let me give a couple of answers:
In my experience thus-far, these are the main reasons, though there are certainly others.