Multicollinearity: Linear Regression and Adding a Squared Term

110 Views Asked by At

I'm confused with the provided answer to this problem:

A researcher estimates a regression with an intercept of (log) earnings on age, years of education and work-experience. $$ \ln E A R N=\beta_{1}+\beta_{2} A G E+\beta_{3} E D U C+\beta_{4} E X P+\varepsilon $$ Since the data does not have work-experience the researcher proposes to use potential work-experience defined as: $$ E X P=A G E-E D U C-6 $$

Using this first model above, STATA outputs an error due to multicollinearity. However, if we use the square of work-experience instead, will STATA still give an error message? Why(not)?

Answer: Because $$ E X P^{2}=A G E^{2}+E D U C^{2}+36-2 A G E * E D U C-12 * A G E+12 * E D U C $$ we get after substitution into $\ln E A R N=\beta_{1}+\beta_{2} A G E+\beta_{3} E D U C+\beta_{4} E X P^{2}+\varepsilon$

the regression model $$ \begin{array}{l}\ln E A R N=\beta_{1}+36 \beta_{4}+\left(\beta_{2}-12 \beta_{4}\right) A G E+\left(\beta_{3}+12 \beta_{4}\right) E D U C+ \\ \beta_{4} A G E^{2}+\beta_{4} E D U C^{2}-2 \beta_{4} E D U C * A G E+\varepsilon\end{array} $$ Now we can recover $\beta_4$ as the coefficient on $\text{AGE}^2$ (or $EDUC^2$ which provides a testable implication). Therefore we can also recover $\beta_{1}, \beta_{2}, \beta_{3}$. If you substitute EXP in the model of part you will see that the parameters cannot be recovered.

What does he explicitly mean by 'recovering' the betas and what does that have to do with multicollinearity? What does he mean that you cannot recover it when substituting into the first regression model?

1

There are 1 best solutions below

0
On BEST ANSWER

For sanity's sake, I will call $y:= \ln(EARN)$, $x_1:=AGE$, $x_2:=EDUC$, $x_3:=EXP$. We are presented with the linear model $$y=b_0+b_1x_1+b_2x_2+b_3x_3+\varepsilon$$ And in matrix form $$\mathbf{y}=\mathbf{X}\mathbf{b}+\mathbf{e}$$ Where $\mathbf{X}=[\mathbf{1},\mathbf{x}_1,\mathbf{x}_2,\mathbf{x}_3]$ and $\mathbf{b}=[b_0,b_1,b_2,b_3]^T$. The proposal to substitute $\mathbf{x}_3$ with the naive $\hat{\mathbf{x}}_3:=\mathbf{x}_1-\mathbf{x}_2-6$ creates a column that is a linear combination of the other ones, making the linear regression unfeasible ($\mathbf{X}^T\mathbf{X}$ is not invertible).

The alternative $\hat{x}^2_3$ instead gives out $$\hat{x}^2_3=x_1^2-2x_1x_2-12x_1+x_2^2+12x_2+36$$

So the regression model (by substitution of $\hat{x}^2_3$ at the place of $x_3$) becomes $$y=\underbrace{b_0+36b_3}_{\alpha_0}+\underbrace{(b_1-12b_3)}_{\alpha_1}x_1+\underbrace{(b_2+12b_3)}_{\alpha_2}x_2+\underbrace{b_3}_{\alpha_4}x_1^2+\underbrace{b_3}_{\alpha_5} x_2^2+\underbrace{(-2b_3)}_{\alpha_6}x_1x_2+\varepsilon$$

and in matrix form

$$\mathbf{y}=\mathbf{H}\boldsymbol{\alpha}+\mathbf{e}$$

where $\mathbf{H}=[\mathbf{1},\mathbf{x}_1,\mathbf{x}_2,\mathbf{x}_1^2,\mathbf{x}_2^2,\mathbf{x}_1\mathbf{x_2}]$ (mind the vector operations are element-wise) and $\boldsymbol{\alpha}=[\alpha_0,\alpha_1,\alpha_2,...,\alpha_6]$. This one should work and we can 'recover' the original parameters by solving the system for the incognitae $b_0,b_1,b_2,b_3$

$$\alpha_4 \textrm{ (or $\alpha_5$) } =b_3$$ $$\alpha_0=b_0+36b_3$$ $$\alpha_1=b_1-12b_3$$ $$\alpha_2=b_2+12b_3$$


As per the 'substitute' $EXP$ in the first model you get

$$y=b_0+b_1x_1+b_2x_2+b_3(x_1-x_2-6)+\varepsilon=$$

$$=\underbrace{b_0-6b_3}_{\beta_0}+\underbrace{(b_1+b_3)}_{\beta_1}x_1+\underbrace{(b_2-b_3)}_{\beta_2}x_2$$

You cannot solve the system to find $b_0,b_1,b_2,b_3$ if you estimate the $\beta_k$'s.