Best fit line to a set of $(x,y)$ points based on sum of squared perpendicular distances to the line

128 Views Asked by At

Given a set of points $ P_i = (x_i,y_i), \ i =1,2,\dots, N $. I want to find the best fitting line to these points. The equation of the line is

$$ n^T ( r - r_0 ) = 0 $$

where $n$ is a unit vector. I want to find $r_0$ and $n$ such that the sum of squared perpendicular distances from $(x_i, y_i)$ to the line is minimized. That is, I want to minimize

$ f = \displaystyle \sum_{i=1}^N d_i^2 $

where $d_i = | n^T (P_i - r_0) | $

My Attempt: is detailed in my answer below.

Your comments, hints, and solutions are highly appreciated.

2

There are 2 best solutions below

0
On BEST ANSWER

In the hope you don't mind if I do not use vectors but apply the LS method to $y=mx+b$ and minimize the sum of the squared perpendicular distances

$\displaystyle\sum_k \frac{(y_k - (mx_k +b))^2}{1+m^2}\Rightarrow min$

That is
$\displaystyle\frac{\displaystyle m^2\left(\sum x_{k}^2\right)+2m b \left(\sum x_{k}\right)-2m\left(\sum x_{k}y_{k}\right)+b^2 n-2b\left(\sum y_{k}\right)+\sum y_{k}^2}{m^2+1}$

$\displaystyle\frac{\partial}{\partial b}=0$ results in $\displaystyle b=\frac{\displaystyle-(\sum x_k)\cdot m+\sum y_k}{n}=\bar{y}-m\bar{x}$

Substituted $b$ in ansatz with it results in (equation $IV$)

$\displaystyle\frac{\displaystyle m^2 n\left(\sum x_k^2\right)-m^2\left(\sum x_k\right)^2-2mn\left(\sum x_ky_k\right)+2m\left(\sum x_k\right)\left(\sum y_k\right)+ n\left(\sum y_k^2\right)-\left(\sum y_k\right)^2}{n\left(m^2+1\right)}$

Fun starts here: $\displaystyle\frac{\partial}{\partial m}=0$ results

$\displaystyle m_1= \left(-n\sum x_k^2+n\sum y_k^2+(\sum x_k)^2-(\sum y_k)^2 \\ +\mathrm{sqrt}\left((n^2\sum x_k^2)^2-2n^2\sum x_k^2\sum y_k^2-2n\sum x_k^2(\sum x_k)^2+2n\sum x_k^2(\sum y_k)^2 \\ +n^2(\sum y_k^2)^2+2n\sum y_k^2(\sum x_k)^2-2n\sum y_k^2(\sum y_k)^2+ 4n^2(\sum x_k y_k)^2 \\ -8n\sum x_k y_k\sum x_k\sum y_k+(\sum x_k)^4+2(\sum x_k)^2(\sum y_k)^2+ (\sum y_k)^4\right)\right)/\left(2(n\sum x_k y_k-\sum x_k\sum y_k)\right)\,\mathrm{,} \\ \displaystyle m_2= \left(-n\sum x_k^2+n\sum y_k^2+(\sum x_k)^2-(\sum y_k)^2 \\ -\mathrm{sqrt}\left((n^2\sum x_k^2)^2-2n^2\sum x_k^2\sum y_k^2-2n\sum x_k^2(\sum x_k)^2+2n\sum x_k^2(\sum y_k)^2 \\ +n^2(\sum y_k^2)^2+2n\sum y_k^2(\sum x_k)^2-2n\sum y_k^2(\sum y_k)^2+ 4n^2(\sum x_k y_k)^2 \\ -8n\sum x_k y_k\sum x_k\sum y_k+(\sum x_k)^4+2(\sum x_k)^2(\sum y_k)^2+ (\sum y_k)^4\right)\right)/\left(2(n\sum x_k y_k-\sum x_k\sum y_k)\right)$

A little simplification with
$\displaystyle n\sum x_k^2-\left (\sum x_k\right )^2=u$
$\displaystyle n\sum y_k^2-\left (\sum y_k\right )^2=v$
$\displaystyle n\sum x_k y_k-\sum x_k\cdot\sum y_k=w$
$v-u=p$,   $2w=q$   results

$m_1=\displaystyle \frac{\sqrt{p^2+q^2}+p}{q}\,\mathrm{,}\,m_2=\frac{-\sqrt{p^2+q^2}+p}{q}$   what looks much less daunting than before.
To test if minimum or not I set $m_1$ in 2nd derivative to $m$ of eq. $IV$

$\displaystyle \frac{4\left(\sqrt{p^2+q^2}+p\right)\left(p^2+q^2\right)q^4}{n\left(\left(\sqrt{p^2+q^2}+p\right)^2+q^2\right)^3}$

Since $n$ is an ordinal number, only sign of $\displaystyle\sqrt{p^2+q^2}+p$ is relevant. Thus $m_1$ is the sought-for result.

In case you are interested in few more details and maybe some routines for pocket calculators (and "emulated" remakes thereof), only then you may find here a paper I did some time ago.

0
On

First, let's manipulate the equation of the best fit line as follows. The equation of the line is

$ n^T (r - r_0) = 0 \hspace{15pt} (1)$

where $r= [x, y]^T $ and $r_0 = [x_0, y_0]^T $ is a point on this line.

Without any loss of generality, we can take

$ r_0 = \alpha n \hspace{15pt}(2)$

where $\alpha \in \mathbb{R}$ is to be identified.

Also, if $\theta$ is the angle that the line makes with the positive $x$ axis, then we can take $n$ as the unit vector:

$n = [-\sin \theta, \cos \theta] \hspace{15pt}(3)$

So that the equation of the line is specified in terms of two scalars, and is given by

$ n^T r = \alpha \hspace{15pt}(4) $

Next, find the expression for $d_i$:

$ d_i = | n^T P_i - \alpha | \hspace{15pt}(5)$

Therefore, we want to minimize

$ f = \displaystyle \sum_{i=1}^N (n^T P_i - \alpha)^2 = \sum_{i=1}^N ( n^T P_i P_i^T n - 2 \alpha n^T P_i + \alpha^2 )\hspace{15pt}(6)$

Differentiating $f$ with respect $\alpha$, we get,

$ \dfrac{\partial f}{\partial \alpha} = \displaystyle \sum_{i=1}^N ( -2 n^T P_i + 2 \alpha ) = 0 \hspace{15pt}(7)$

Solving for $\alpha$, we get

$ \alpha = n^T \overline{P} \hspace{15pt}(8)$

where

$ \overline{P} = \dfrac{1}{N} \displaystyle \sum_{i=1}^N P_i \hspace{15pt}(9)$

Substituting $(8)$, into $(6)$,

$ f = \displaystyle \sum_{i=1}^N n^T P_i P_i^T n - 2 (n^T P_i) (n^T \overline{P} ) + n^T {\overline{P} \overline{P}}^T n \hspace{15pt}(10) $

Re-writing this, it becomes,

$ f = \displaystyle n^T \left( \sum_{i=1}^N (P_i P_i^T - 2 P_i \overline{P}^T + {\overline{P} \overline{P}}^T \right) n \hspace{15pt}(11) $

And this simplifies further to

$ f =\displaystyle n^T \left( \left(\sum_{i=1}^N {P_i Pi}^T \right) - N {\overline{P} \overline{P}}^T \right) n \hspace{15pt}(12) $

Hence, $f$ is a quadratic form in $n$. The minimum $f$ occurs with $n$ chosen as the unit eigenvector corresponding to the minimum eigenvalue of

$ Q = \displaystyle \left(\sum_{i=1}^N {P_i P_i}^T \right) - N {\overline{P} \overline{P}}^T \hspace{15pt} (13) $

$Q$ is a $2 \times 2$ positive definite or positive semi-definite matrix whose eigenvalues/eigenvectors are easy to compute.

This completes the specification of the best fitting line.

Numerical Example:

To assess the method outlined above, I started with a line whose equation is

$ y = 0.8 x - 3 $

Then took random points $(x_i, y_i), i = 1, 2, \dots, 16$, with $x_i \in [0, 15]$, and $y_i = 0.8 x_i - 3 $. Then added omni-directional noise to these points as follows

$ (x'_i , y'_i ) = (x_i, y_i) + \rho (\cos \phi, \sin \phi) $

where $\rho$ is uniformly distributed in $[0, 0.7]$ , and $\phi$ is uniformly distributed in $[0, 2 \pi] $

I then applied the above outline method to the data set $\{ (x'_i, y'_i) \} $.

The line of best fit that I got with a particular run was

$ y = 0.82077 x- 3.27044 $

The input points and the best fit line (in blue) are shown below. The line in green is $y = 0.8 x - 3$.

enter image description here

In a second run, I took $50$ samples of the same linear function $y = 0.8 x - 3 $ over the interval $[0, 15]$, and added omni-directional noise of the form mentioned above, but with $\rho$ uniformly distributed in $[0, 1.5]$.

The line of best that I got was

$ y = 0.824947 x - 3.18213 $

Both the original line and the line of best fit and the samples are shown below. The line of best fit is in blue, while the original linear function is in green.

enter image description here