my question is about the linear fit and the least squares method.
Why do we decide to minimize the quantity
$$ S = \sum_{i=1}^n r_i^2 $$
instead of this one:
$$ r_i = y_i - f(x_i, \beta) $$
?
Let's think at the least squares fit intuitively: I want to find a straight line which approximates the distribution of some points in the $xy$ plane. The first thing I think is that it is good to find a line such that its distance $(r_i)$ from each point is minimum, not necessarily the square of that distance. Why this choice?
Using the sum of the signed deviations doesn't really make sense because then you can just make $b$ arbitrarily large and positive to get a large negative signed deviation.
Minimizing the absolute deviations, i.e. minimizing $\sum_{i=1}^n |y_i-f(x_i,\beta)|$, would make a certain degree of sense.
The most straightforward reason to favor least squares instead is the Gauss-Markov theorem, which says (in the simplest case) that if the $y_i$ are random additive perturbations of $f(x_i,\beta)$ where the perturbations have mean zero, are uncorrelated, and have the same finite variance, then least squares provides the best linear unbiased estimator for the parameters. Some of these assumptions can also be relaxed by considering weighted versions of least squares.
But if you just have data that you want to fit (to interpolate, perhaps) and aren't thinking in terms of a noise model, then a different metric might work out. Some advantages of least squares from this point of view include: