Let's say that we have three data sets that share some of the same independent variables:
| $Y_1$ | $X_1$ | $X_2$ | $X_3$ |
|---|---|---|---|
| 0 | 1 | -1 | 7 |
| 3 | 4 | -5 | 2 |
| 2 | 2 | -3 | 4 |
| $Y_2$ | $X_2$ | $X_3$ |
|---|---|---|
| 5 | -10 | 1 |
| 1 | -2 | 6 |
| 2 | -4 | 4 |
| $Y_3$ | $X_1$ | $X_3$ |
|---|---|---|
| 8 | 11 | 0 |
| 7 | 9 | 0 |
| 6 | 7 | 1 |
| 5 | 5 | 1 |
In the second data set, for example, there is no independent variable $X_1$, since it is expected that it has little to no impact on the dependent variable $Y_2$.
Question: How do we estimate the coefficients for each predictor ($X_1$, $X_2$, and $X_3$) with multiple linear regression model(s), where we consider all (or some) of the data sets to varying degrees?
I see two options:
We can concatenate each data set together and impute the missing values (with zeros for example) to perform one global multiple linear regression;
Or we can perform a multiple linear regression for each data set separately and somehow weigh the estimated coefficients based on the goodness of fit, the confidence in each coefficient estimate, or even the "validity" of each row.
My data sets are not created equally as some appear to have less usable information than others and appear to throw off the estimations of the coefficients in a global regression model. So, perhaps, some data sets can be removed. I also have a rough idea of how "valid" each row in the data sets is, since the values of the predictors ($X_1$, $X_2$, and $X_3$) were also estimated using multiple linear regression in a previous step of my algorithm. I have the f_values and f_pvalues of the regressions used to calculate each row in the data sets as well as the t_values and p_values for each predictor present in the previous regressions. For example, the second data set would include the following data:
| $Y_2$ | $f_{value}$ | $f_{pvalue}$ | $t_{X_2}$ | $t_{X_3}$ | $p_{X_2}$ | $p_{X_3}$ | $X_2$ | $X_3$ |
|---|---|---|---|---|---|---|---|---|
| 5 | ... | ... | ... | ... | ... | ... | -10 | 1 |
| 1 | ... | ... | ... | ... | ... | ... | -2 | 6 |
| 2 | ... | ... | ... | ... | ... | ... | -4 | 4 |
I am a bit lost as to which method is the most statistically sound, since I have so much data I can potentially use. I have tried imputing zeros in a global multiple linear regression without much success, but maybe different imputation values would work. I can assess the success of a method, since I have a control study where I know the true value of the coefficients
I am not expecting to find the magic answer here on StackExchange; I am more looking for ideas and tips, especially with respect to which statistical data are useful. For instance, I have been told that p_values don't give much information as they are heavily dependent on the number of rows in the data sets (and my data sets do vary in size considerably).