Can linear dependent rows be removed in Linear Regression?

55 Views Asked by At

Concerning the multiple linear regression, if we have n features and m inputs, so that m > n ($A_{mxn}$), is it possible to exclude some linear dependent inputs without compromising the model? Why?

Is it a problem to have dependent inputs?

1

There are 1 best solutions below

0
On BEST ANSWER

No. Unlike linearly dependent columns that means that you simply recorded the same information ( $=$ linearly transformed variable), identical rows means identical observations. E.g., you may observe some periodic phenomenon, as such you may observe regularly same observations every certain time period. Discarding them is the same is discarding the main "signal" in the data. A more technical aspect of identical observations is if that under the alternative hypothesis the model significance ($p.value$ and $F_{stat}$) are monotonically non-increasing end non-decreasing functions (respectively) of the number of repetition. Namely, $$ F_{stat} = \frac{\sum_{i=1}^{n_1} ( \hat{y}_i - \bar{y})^2/p }{\sum_{i=1}^{n_1} ( \hat{y}_i - y_i)^2/(n_1-p-1)} = \frac{MSReg(n_1)}{MSE(n_1)}, $$
hence if for some $N \in \mathbb{N}$ and $n_2 > n_1 > N$ where you have added duplicates of existing observations, $$ MSReg(n_2) =\sum_{i=1}^{n_2} ( \hat{y}_i - \bar{y})^2/p \ge \sum_{i=1}^{n_1} ( \hat{y}_i - \bar{y})^2/p = MSReg(n_1) , $$ while $$ MSE(n_2) = \hat{\sigma}^2_{n_2} \approx \sum_{i=1}^{n_1} ( \hat{y}_i - y_i)^2/(n_1-p-1) = \hat{\sigma}^2_{n_1} = MSE(n_2). $$