Given a data set $\{(x_i, y_i) \mid i = 1, \cdots, n\} \subset \mathbb R^2$, I want to minimize the cost function
$J(a,b) = \sum_{i=1}^n (h(x_i)-y_i)^2$, where $h(x) = ax + b$.
Here, each term $|h(x_i) - y_i|$ measures the distance between $h$ and $(x,y)$ along $Y$-axis. (intuitively, this corresponds to a vertical line from $(x_i, y_i)$ to $h$). Getting $h$ is not a big deal; just take a derivative, and call this function $H_1$.
However, I have a feeling that I should fit the linear regression which minimizes the orthogonal distance between data set and regression line. (Such regression line can be obtained by taking a sequence of functions $h_1, h_2, \cdots$. It is clear that $h_n \rightarrow H_2$ pointwise for some linear function $H_2$.)
My questions are:
- $H_1 = H_2$?
- If they are different, what's the philosophical reason that we take $H_1$ instead of $H_2$? Is it simply because of easier computation?