How to pick the "best" line equation that fits data and passes through the origin?

204 Views Asked by At

I have some data shown in the image. Each symbol corresponds to a different dataset. As it can be seen the data initially behaves like a line and after some points it follows other behaviour. enter image description here

The idea is to fit a line that passes through the origin and that the line "is good enough" approximating the behaviour of the data where it is linear.

What I tried is to do the following: I pick a minimum of data (e.g. $m_i$) for each data set $D_i = \{d_{ij}=(x_j,y_j)_i, j\in\{1,...,N_i\}\}$, where $N_i = |D_i|$ is the number of data of each set, and $x_1=0, x_p < x_q$ if $p < q$ . Now, for $D_i$ we consider subsets $S_{ik}=\{d_{ij}=(x_j,y_j)_i, j\in\{1,...,k_i\}\}$, where $k_i \in \{m_i,...,N_i\}$. Each of these subsets has associated with it a correlation coefficient $p_{ik}$. My idea was to pick the maximum of these coefficients and to fit a line $y=mx$ through the data in $S_{ik}$ associated with $\max_k p_{ik}$.

The problem with this is that the correlation coefficient tells how the $y$ data behaves with respect to the $x$ data, i.e. how good a line would fit the data. This line intercepts the $y$ axis at a point that in general is not $0$.

My next try was to instead fit a line $y=mx$ for each $S_{ik}$ and then compute the mean squared error MSE. Then pick the subset that is associated with the minimum value of MSE. This also fails because of course the minimum value is the one associated with the subset with less number of data (MSE penalizes the variation very hard, I look for something "softer").

Is there a way of knowing where the data behaves like a line passing through the origin? I was thinking of taking a small subset of data S_{ik}, fitting a line $y_i=mx$, computing the MSQ. Then adding one data point to the subset and computing its error, then compare it with the MSQ and decide whether to continue the process of adding data, or stopping.

Any help is very much appreciated.

2

There are 2 best solutions below

5
On

I worked this out a number of years ago.

If you go the the process of minimizing $\sum (y_i-ax_i)^2$ as a function of $a$, you get $a =\frac{\sum x_iy_i}{\sum x_i^2} $.

Here's how the math works out:

Let $D =\sum_{i=1}^n (y_i-ax_i)^2 $. Then

$\begin{array}\\ \frac{\partial D}{\partial a} &=\sum_{i=1}^n \frac{\partial (y_i-ax_i)^2}{\partial a}\\ &=\sum_{i=1}^n -2x_i(y_i-ax_i)\\ &=-2(\sum_{i=1}^nx_iy_i-a\sum_{i=1}^nx^2_i)\\ &=0 \qquad\text{when }a = \dfrac{\sum_{i=1}^nx_iy_i}{\sum_{i=1}^nx^2_i}\\ \end{array} $

For this value of $a$,

$\begin{array}\\ D &=\sum_{i=1}^n \left(y_i-ax_i\right)^2\\ &=\sum_{i=1}^n \left(y^2_i-2ax_iy_i+a^2x^2_i\right)\\ &=\sum_{i=1}^n y^2_i-2a\sum_{i=1}^nx_iy_i+a^2\sum_{i=1}^nx^2_i\\ &=\sum_{i=1}^n y^2_i-2\dfrac{\sum_{i=1}^nx_iy_i}{\sum_{i=1}^nx^2_i}\sum_{i=1}^nx_iy_i+\left(\dfrac{\sum_{i=1}^nx_iy_i}{\sum_{i=1}^nx^2_i}\right)^2\sum_{i=1}^nx^2_i\\ &=\sum_{i=1}^n y^2_i-\dfrac{\left(\sum_{i=1}^nx_iy_i\right)^2}{\sum_{i=1}^nx^2_i}\\ \end{array} $

Note that both the value of $a$ and the value of $D$ can be easily updated when a data point is either added or removed. That way, you can examine how $D$ changes (though you probably want to compute $\frac1{n}D$) and stop when it gets too large.

2
On

Are you required to use a straight line? It looks to me like a quadratic would fit better and require only a little more trouble.