When fitting a curve in $\mathbb R^2$ to data points in $\mathbb R^2$ (example), why is each point's vertical distance from the curve squared instead of its shortest (possibly diagonal) distance from the curve?
Ignoring my poorly drawn curves, it seems obvious that
is a worse curve-to-point fit than
even though the red line is shorter in the first image, because you can draw the much shorter blue line instead (which I labeled b in the second image). Minimizing $b^2$ seems much more important than minimizing $a^2$.
In fact, diagonal distance is used in some cases.
From the operative point of view, the standard "vertical" distance is taken when it is assumed that in the 2D data available, the $x$ has been measured (almost) exactly, while the $y$ is subject to error.
When both $x$ and $y$ are subject to error, under the usual assumptions of independence, near Gaussian distribution, etc. the distance shall be taken along a line with slope $\sigma_y / \sigma_x$.
That's called Total least squares.