Separating populations and estimating line-fit parameters

Question

Separating populations and estimating line-fit parameters

108 Views Asked by Bumbble Comm At 11 May 2026 - 3:46

Given a dataset containing two populations, each of which can be described by a linear relationship between two variables in each sample with high R$^2$, how does one separate the two populations (and incidentally compute the line-fit)?

This is fairly easy to do graphically - just create a scatterplot and the two lines are pretty apparent. But how does one do this algorithmically?

More generally, given a dataset containing an unknown number n of populations, each of which can be fit to a line with some lower bound on R$^2$ (e.g., .95), how does one separate the data into the minimum number of populations satisfying the R$^2$ criterion?

Original Q&A

There are 3 best solutions below

**Bumbble Comm** · Answer 1 · 2013-12-27 05:39:44

If I properly understand, you have two relations Y = a1 + b1 X for one population and Y = a2 + b2 X for a second population and you would like to merge them. If my hypothesis is correct, build a model Y = a + b X + c Z in which Z will be 1 if belonging to the first population, 2 to the second population and so on.

**Bumbble Comm** · Answer 2 · 2013-12-28 17:16:02

See below for a better way, but how about a very naive iterative algorithm:

Use the dataset and estimate a line fit: $y=a+bX$.
Throw out the observation with the greatest distance to the estimated line, i.e., remove $\text{argmax}_i (y_i-a-bX_i)^2 $ if $\max_i (y_i-a-bX_i)^2> t\ge0 $, where $t$ is some threshold, and continue at (1.); otherwise stop.

Once you stop, the data set that remains should be close around some line; that would be population 1. Everything you threw out should be population 2. But I am pretty sure this may not always converge to a good classification. (This depends on $t$, but also how noisy the data is.) You can check if the classification was successful by looking at the fit for all the observations you threw out in the iterative prodedure; if the line fits well, it worked.

Otherwise there are some cluster methods which find groups that belong together in your data. In particular, you could look at k means clustering (which is a less naive way to do the above, it seems). If you chose such a method, find one where you can use your knowledge that the relationship within clusters is linear, and that you have two clusters. It will improve classiciation considerably.

**Bumbble Comm** · Answer 3 · 2014-01-21 19:13:14

Bumbble Comm On 21 Jan 2014 - 7:13

You might consider trying DBSCAN the density clustering algorithm. It's available in the Python scikit library along with regression software for this kind of work.

Separating populations and estimating line-fit parameters

There are 3 best solutions below

Related Questions in DATA-ANALYSIS

Related Questions in DATA-MINING

Trending Questions

Popular # Hahtags

Popular Questions