Combine linear models of different sets of data.

267 Views Asked by At

I'm working on a large data set D that can be partitioned into some disjoint subsets D1, D2, ..., Dn. For each subset Di, I have a linear model Mi that minimizes the residual error for data in Di.

  1. Is there any way to create a combined model M using parameters of Mi's such that M minimizes the residual error for data in D = D1 + D2 + ... + Dn? I don't want to recompute the model parameters for scratch because it is time-consuming.

  2. Assume I have the linear model M for some data set D, and another model M' for a sub data set D' of D. Is there any way to create a reduced model M'' using parameters of M and M' such that M'' minimizes the residual error for data in D'' = D - D'?

Any help would be appreciated!! Thanks!

1

There are 1 best solutions below

0
On

Almost the same question was asked on yesterday. I shall basically repeat myself; it is a suggestion based on similar previous experience.

Say that for three independent sets A, B and C , you built linear regressions $Y=a_1 + b_1 X$, $Y=a_2 + b_2 X$ and $Y=a_3 + b_3 X$; now you want to consider the data coming from the union of A, B and C.

In a first step, I should perform a multilinear regression using as a model
$$Y = a + b Z + c X + d X Z$$ introducing a variable Z which would be given a value of $1$ if the data point belongs to set A, $2$ if the data point belongs to set B or $3$ if the data point belongs to set C. Now, the standard analysis could apply to check if, yes or no, parameters $b$ and $d$ are statiscally significant. If they are, then using the union of the three sets could not be done.

I am afraid you could not avoid reworking the matrices if you need more than the coefficients $a$ and $b$ of the regression.

If you just need coefficients $a$ and $b$ for the overall regression, using normal equations allow to reuse the terms used in each regression to compute them at almost no extra cost.