Separating populations and estimating line-fit parameters

107 Views Asked by At

Given a dataset containing two populations, each of which can be described by a linear relationship between two variables in each sample with high R$^2$, how does one separate the two populations (and incidentally compute the line-fit)?

This is fairly easy to do graphically - just create a scatterplot and the two lines are pretty apparent. But how does one do this algorithmically?

More generally, given a dataset containing an unknown number n of populations, each of which can be fit to a line with some lower bound on R$^2$ (e.g., .95), how does one separate the data into the minimum number of populations satisfying the R$^2$ criterion?

3

There are 3 best solutions below

2
On

If I properly understand, you have two relations Y = a1 + b1 X for one population and Y = a2 + b2 X for a second population and you would like to merge them. If my hypothesis is correct, build a model Y = a + b X + c Z in which Z will be 1 if belonging to the first population, 2 to the second population and so on.

0
On

See below for a better way, but how about a very naive iterative algorithm:

  1. Use the dataset and estimate a line fit: $y=a+bX$.

  2. Throw out the observation with the greatest distance to the estimated line, i.e., remove $\text{argmax}_i (y_i-a-bX_i)^2 $ if $\max_i (y_i-a-bX_i)^2> t\ge0 $, where $t$ is some threshold, and continue at (1.); otherwise stop.

Once you stop, the data set that remains should be close around some line; that would be population 1. Everything you threw out should be population 2. But I am pretty sure this may not always converge to a good classification. (This depends on $t$, but also how noisy the data is.) You can check if the classification was successful by looking at the fit for all the observations you threw out in the iterative prodedure; if the line fits well, it worked.

Otherwise there are some cluster methods which find groups that belong together in your data. In particular, you could look at k means clustering (which is a less naive way to do the above, it seems). If you chose such a method, find one where you can use your knowledge that the relationship within clusters is linear, and that you have two clusters. It will improve classiciation considerably.

1
On

You might consider trying DBSCAN the density clustering algorithm. It's available in the Python scikit library along with regression software for this kind of work.