Minimize the sum of distances to a line

1.4k Views Asked by At

Given $n$ points in $\mathbb{R}^2$ (I think there won't exist any differences for $\mathbb{R}^n$) find the parameters $A$ and $v$ such that the sum of distances from the points to the line $A+tv$ is minimum.

I know that is a classic problem: find such a line that minimizes the sum of the squares of the distances, but for this problem I couldn't find a solution.

2

There are 2 best solutions below

5
On

Given the points $\{(t_i,y_i)\}$, your cost function (sum of squares of distances) would be $$C=\sum_{i=1}^n\frac{(y_i-vt_i-A)^2}{1+v^2}=\frac{1}{1+v^2}\sum_{i=1}^n(y_i-vt_i-A)^2.$$ For reference, see the wiki article on the Distance from a point to a line, and using the equation $y=vt+A$, rearranged to $y-vt-A=0$. Now this is non-linear in the coefficients ($v$, at least), which is different from the usual linear least-squares regression where you minimize vertical distances. Let's calculate the partial derivatives: \begin{align*} \frac{\partial C}{\partial A}&=-\frac{2}{1+v^2}\sum_{i=1}^n(y_i-vt_i-A), \\ \frac{\partial C}{\partial v}&=\frac{(1+v^2)(-2)\sum_{i=1}^n(t_i)(y_i-vt_i-A)-(2v)\sum_{i=1}^n(y_i-vt_i-A)^2}{(1+v^2)^2}. \end{align*} Setting these equal to zero produces a few simplifications: \begin{align*} 0&=\frac{\partial C}{\partial A}=-\frac{2}{1+v^2}\sum_{i=1}^n(y_i-vt_i-A) \\ 0&=\frac{\partial C}{\partial v}=-2\,\frac{(1+v^2)\sum_{i=1}^n[t_i(y_i-vt_i-A)]+v\sum_{i=1}^n(y_i-vt_i-A)^2}{(1+v^2)^2} \\ \\ 0&=\sum_{i=1}^n(y_i-vt_i-A) \\ 0&=(1+v^2)\sum_{i=1}^n[t_i(y_i-vt_i-A)]+v\sum_{i=1}^n(y_i-vt_i-A)^2. \end{align*} Then you want to solve this system using gradient descent, or stochastic gradient descent.

It's highly doubtful you could get an analytic solution similar to the normal equations for linear regression. However, the first equation is linear, and it's straight-forward to solve for $A$: \begin{align*} 0&=\sum_{i=1}^n(y_i-vt_i-A) \\ 0&=\sum_{i=1}^n(y_i-vt_i)-nA \\ nA&=\sum_{i=1}^n(y_i-vt_i) \\ A&=\frac1n \sum_{i=1}^n(y_i-vt_i). \end{align*} With a view towards plugging into the other equation, we would need to change the summation variable: $$A=\frac1n \sum_{j=1}^n(y_j-vt_j). $$ Plugging into the other equation would yield $$0=(1+v^2)\sum_{i=1}^n\left[t_i\left(y_i-vt_i-\frac1n \sum_{j=1}^n(y_j-vt_j)\right)\right]+v\sum_{i=1}^n\left[y_i-vt_i-\frac1n \sum_{j=1}^n(y_j-vt_j)\right]^{\!2}. $$ You or may not have gained anything by the substitution.

2
On

Here's an algorithm, rather than a closed-form solution (cf. finding the median of a set of numbers): Try all lines going through pairs of input points (it's enough to try only the halving lines).

Why it works: W.l.o.g. the optimal line goes through an input point (because otherwise you can shift the line parallel to itself without increasing the objective function), so consider every input point as the origin through which the line goes. Fix also the line's "combinatorial type", i.e., the sets $U$ and $L$ of the input points that are above and below the line resp. That is, consider the line'a rotation about the origin, between hitting a point from above and from below. The optimum is attained either at the extreme rotation (i.e., when the line passes through two input points) or in the middle -- and the claim is that the latter is not the case (in any non-trivial instance where the input points are not all collinear). Indeed, the distance from $(x_i,y_i)$ to the line $x\cos\phi+y\sin\phi=0$ (going through the origin) is $\pm( x_i\cos\phi+y_i\sin\phi)$ (depending on whether the point is above or below the line), so the total distance from all the points is

$$\sum_U(x_i\cos\phi+y_i\sin\phi) - \sum_L(x_i\cos\phi+y_i\sin\phi)=\Delta x\cos\phi+\Delta y\sin\phi$$ where $\Delta x = \sum_U x_i - \sum_L x_i, \Delta y= \sum_U y_i - \sum_L y_i$. But the weighted sum of cosine and sine can be made 0 (by the direction perpendicular to the vector of weights), hence if the minimum is attained at a non-extreme angle, the minimum is 0.