Linear regression problem

56 Views Asked by At

Let's say researchers observe $\{E_i, D_i\}_{i=1}^n$ where $E_i$ represents a person's years of education and $D_i = \begin{cases} 0 & \text{, if neither parent has a college degree} \\ 1 & \text{, if at least one parent a college degree} \end{cases}$ The researchers estimate a linear regression with

$$E = B_1 + B_2D + \epsilon$$

, and find that,

$$\begin{bmatrix} b_{1} \\ b_{2} \\ \end{bmatrix} = \begin{bmatrix} 10.5 \\ 4.3 \\ \end{bmatrix}$$ and,

$$\begin{bmatrix} \hat{Var(b_1)} \\ \hat{Var(b_2)} \\ \end{bmatrix} = \begin{bmatrix} 3.8 \\ 1 \\ \end{bmatrix}$$

(i) What is the estimate of the expected number of years of education for a person who had at least one parent attend college?

(ii) Assume $\bar{D_n}= \frac{1}{n}\sum_i^n{D_i} = 0.56$. What is the average of $E$ in this sample?

(iii) Using a normal approximation, determine a $90$% confidence interval for $B_2$.

(iv) Can a $95$% confidence interval be found for $B_2$ using a t-approximation? If yes, find it. If not, explain why not.

My attempt:

(i) From the results, we can assume the graph was $E = 10.5 + 4.3 D$. If $D = 1$, $E = 10.5 + 4.3 = 14.8$

(ii) $\mathbb{E}[E] = \mathbb{E}[B_1 + B_2D + \epsilon] = b_1 + b_2\mathbb{E}[D] + \mathbb{E}[\epsilon] = 10.5 + (4.3 * 0.56) + 0 = 12.908$

(iii) With the normal approximation, the $90$% confidence interval is (I think) given by

$$C.I. = [b_2 - z_{\alpha/2} \sqrt{Var(b_2)}, b_2 + z_{\alpha/2} \sqrt{Var(b_2)}]$$

We can easily find that $z_{0.05} = 1.645$. Then,

$$C.I. = [4.3 - 1.645 \sqrt{1}, 4.3 + 1.645 \sqrt{1}]$$

$$C.I. = [2.66, 5.94]$$

(iv) A t-distribution cannot be used here because we do not know the sample size and therefore cannot determine the degrees of freedom.

Is this correct? For (i), I was not sure whether I'm supposed to compute the expectation of $E$ or to do what I showed above. I'm not quite sure of the other solutions either.

1

There are 1 best solutions below

0
On

Yes, everything is correct. Note that in linear regression, the predictions $\hat{y_i}$ are always called the "expected" value given some values of the inputs, because we are only estimating the value of $y$ and we are in a sense "expecting" our estimates to represent the true value. Hope this clears up your confusion.

If you study linear regression in terms of Machine Learning, you will see that the optimal estimator function $f^*(x)=\hat{y}$ for linear regression with the square loss function is in fact $f^*(x)=E(y|x)$.