Concerning the multiple linear regression, if we have n features and m inputs, so that m > n ($A_{mxn}$), is it possible to exclude some linear dependent inputs without compromising the model? Why?
Is it a problem to have dependent inputs?
Concerning the multiple linear regression, if we have n features and m inputs, so that m > n ($A_{mxn}$), is it possible to exclude some linear dependent inputs without compromising the model? Why?
Is it a problem to have dependent inputs?
Copyright © 2021 JogjaFile Inc.
No. Unlike linearly dependent columns that means that you simply recorded the same information ( $=$ linearly transformed variable), identical rows means identical observations. E.g., you may observe some periodic phenomenon, as such you may observe regularly same observations every certain time period. Discarding them is the same is discarding the main "signal" in the data. A more technical aspect of identical observations is if that under the alternative hypothesis the model significance ($p.value$ and $F_{stat}$) are monotonically non-increasing end non-decreasing functions (respectively) of the number of repetition. Namely, $$ F_{stat} = \frac{\sum_{i=1}^{n_1} ( \hat{y}_i - \bar{y})^2/p }{\sum_{i=1}^{n_1} ( \hat{y}_i - y_i)^2/(n_1-p-1)} = \frac{MSReg(n_1)}{MSE(n_1)}, $$
hence if for some $N \in \mathbb{N}$ and $n_2 > n_1 > N$ where you have added duplicates of existing observations, $$ MSReg(n_2) =\sum_{i=1}^{n_2} ( \hat{y}_i - \bar{y})^2/p \ge \sum_{i=1}^{n_1} ( \hat{y}_i - \bar{y})^2/p = MSReg(n_1) , $$ while $$ MSE(n_2) = \hat{\sigma}^2_{n_2} \approx \sum_{i=1}^{n_1} ( \hat{y}_i - y_i)^2/(n_1-p-1) = \hat{\sigma}^2_{n_1} = MSE(n_2). $$