Linear regression model assumptions

1.8k Views Asked by At

Linear regression models have to follow 2 key assumptions: (1)error terms are iid and each follows normal distribution with zero mean and variance sigma^2 (2)the matrix X has to be non-random and full column rank. However, I am confused why can we assume the error terms are normally distributed? Also, does the second assumption implies that all the explanatory variables are independent to others? Thanks

2

There are 2 best solutions below

0
On BEST ANSWER

I write the assumptions out using the acronym LINE to make the assumptions simple and easy to remember.

  1. Linearity. We assume that each independent variable $X_i$ is linearly related to the dependent variable $Y$. If this condition is not met, linear regression is considered to be inappropriate. We may transform our data to ensure that $X_i$ and $Y$ are approximately linearly related.
  2. Independence. We assume that each of our observations are independent of one another. (It is sometimes written that the $y_i$ are independent of one another or that the $\varepsilon_i$ are independent of one another.)
  3. Normality. We assume that the error terms are Normally distributed. While it is possible to fit a linear regression model to data where the errors are not Normally distributed (much like it's possible to fit a linear regression model to data that clearly do not follow a linear trend), this is inadvisable and linear regression is seen as inappropriate. There are other regression models (generalized linear models is the term that is typically used) that are appropriate in cases where the error terms are not Normally distributed.
  4. Equality of Variances. We assume that the data points are homoscedastic - that is, equally scattered about the linear regression line, regardless of the value of the $X_i$ variables. If the data are heteroscedastic - the variance is not equal for all values of $X_i$, then our analyses that rely on Normality are inappropriate. It is possible to use robust standard errors (also known as Huber-White standard errors) to correct for this if this assumption doesn't hold.

While it makes sense for $X$ to be of full rank, this does not necessarily need to be the case. There are numerous benefits to $X$ being of full rank and it allows for maximum interpretability. However, one can conduct inference on parameters without $X$ being of full rank. There are also methods (i.e. PCA) that are designed to take non-independent IVs and project them such that they will be independent in your analysis.

To your question above, we can assume that the error terms are iid $Normal(0,\sigma^2)$, but only if this assumption makes sense. If you know from the subject material or from your data that the assumptions of independence, Normality, or equality of variances are violated, then perhaps a linear regression model is not appropriate. (While not encapsulated in your question, the linearity assumption is also very important.) In this case, I would suggest looking into ways to transform your data to ensure the conditions are met or research different types of models (i.e. generalized linear models) that are designed to account for data that do not follow the four LINE assumptions mentioned above.

10
On
  1. Linear regression does not assume any distribution on the errors whatsoever. they can be drawn from any distribution i.i.d or not.
  2. X has to be deterministic and of full rank - Yes
  3. Also, does the second assumption implies that all the explanatory variables are independent to others? - please elaborate as I don't understand the question