Minimizing error with several linear regressions and choosing the best one for each point

33 Views Asked by At

Given $n$ points $(x_1, y_1), (x_2, y_2), \ldots, (x_n ,y_n)$ and $m \in \mathbb N$. We aim to

  • find $m$ linear models $f_j : y = k_j x + b_j, j \in [m]$
  • to minimize $\sum_{i \in [n]} \min_{j \in [m]} err((x_i, y_i), f_j)$, where $err$ can be any kind of error function, e.g., it can be $err((x_i, y_i), f_j) = (y_i - f_j(x_i))^2$.

Intuition: Sometimes data points can come from different distributions. If we have multiple linear regressions and allow each point to choose the best linear model for it, we may have better fitting.

Question: When $m = 1$, it is just normal linear regression. Can we have a closed-form solution for general $m$?

1

There are 1 best solutions below

1
On BEST ANSWER

Your problem is NP-hard. Therefore, you shouldn't expect any general algorithm that is efficient and works in all cases. In particular, this means there is no reasonable closed-form solution (as any reasonable closed-form solution should yield a polynomial-time algorithm), assuming P!=NP.

If you need an algorithm that is as good as possible in practice, I suggest asking on Stats.SE or CS.SE, and giving some information about typical values for $n,m$ for the kinds of problems you face. It might be possible to devise an algorithm, e.g., based on the EM method, that works reasonably well in many settings.


Proof of NP-hardness: The proof is by reduction from the ABC-partition problem, a variant of the 3-partition problem.

Let $A,B,C,T$ be an instance of ABC-partition, i.e., let $T$ be a target integer and $A,B,C$ be any three sets of $m$ integers, such that the sum of numbers in each set is $mT$. Create $n=3m$ data points $(-1,a-T/3)$, $(0,T/6-b/2)$, $(1,c-T/3)$ for each $a\in A,b \in B, c\in C$. Now check whether there is a way to fit $m$ linear models to these $n$ data points, such that the total error is 0.

I claim that every solution with 0 error corresponds to a solution to ABC-partition, and vice versa. Note that there exists a straight line passing through the three data points $(-1,u)$, $(0,v)$, $(1,w)$ iff $v=(u+w)/2$, i.e., iff $u-2v+w=0$.

Suppose we have a solution to ABC-partition. Then for each triple $(a,b,c)$ in the solution, we have $a+b+c=T$, so $(a-T/3)+(b-T/3)+(c-T/3)=0$. Then setting $u=a-T/3$, $v=T/6-b/2$, $w=c-T/3$, we see that

$$u-2v+w=(a-T/3)-2(T/6-b/2)+(c-T/3)=(a-T/3)+(b-T/3)+(c-T/3)=0,$$

so there is a straight line through this triple. This gives a way to find $m$ lines, one through each triple in the solution to the ABC-partition problem, that go through all $3m$ points, with 0 error. Since the solution to ABC-partition covers every element of $A\cup B \cup C$, every data point is covered by some line that goes exactly through that line.

Conversely, suppose we have any solution to your problem with total error 0, i.e., $m$ lines such that each of the $3m$ data points is touched by one of the $m$ lines. Clearly, no line can touch 4 or more points (since $A,B,C$ are sets); any line can touch at most 3 points. Therefore, the only way to cover all $3m$ points is for each line to cover a different set of 3 out of the $3m$ data points. Consider any line. It must cover three points, call them $(-1,u)$, $(0,v)$, $(1,w)$. Since the line goes through all three points, by the above discussion, we must have $u-2v+w=0$. Now by construction, we have $a\in A, b\in B, c\in C$ such that $u=a-T/3$, $v=T/6-b/2$, $w=c-T/3$. It follows that $(a-T/3)-2(T/6-b)+(c-T/3)=0$, from which it follows that $a+b+c=T$. Therefore, we immediately obtain a solution to the ABC-partition problem: there is one triple per line, containing the values $(a,b,c)$ such that the line goes through $(-1,a-T/3)$, $(0,T/6-b/2)$, $(1,c-T/3)$.


A careful reader will note that I assumed that $A,B,C$ are sets. The standard definition of ABC-partition treats $A,B,C$ as multisets. However, ABC-partition remains NP-hard even if we restrict to problem instances where $A,B,C$ are sets (i.e., contain no duplicates). See the paper Multigraph realizations of degree sequences: Maximization is easy, minimization is hard, by Heather Hulett, Todd G. Will, Gerhard J. Woeginger, Operations Research Letters, Corollary 8. I learned of this paper from https://cstheory.stackexchange.com/q/716/5038.