Significance level for a hypothesis test for a linear regression

205 Views Asked by At

Consider linear regression model $Y_i=a+b\cdot x_i+\epsilon_i$, $i=1,2,3,4,5$, where $a,b\in\mathbb{R}$ are unknown and $x_1=x_2=1,x_3=3,x_4=x_5=5$, $\epsilon_i$ are iid, normally distributed with mean $=0$ and variance $=9$. Consider the hypothesis $H_0:b=0$ with the alternative $H_1:b\neq 0$, with the critical region $\{|\hat{b}|>c\}$, where $\hat{b}$ is a maximum likelihood estimator and $c$ is chosen in such a way that the significance level of a test is equal to $0,05$. Calculate $c$.

Could you help me with this exercise? It is taken from the actuary exam organized in my country. I thought that I am able solve this exercise, however my answer c=0,7528 is wrong, the correct answer is c=1,47.

Edited: the exercise seems very easy, but I'm sure that my method of solution is wrong, as I've seen the similar exercise and my method results with the wrong answer as well. That's why I've decided to start a bounty, however I do not know how (I can see "question eligible for bounty in 59 minutes" only, not "start a bounty")

2

There are 2 best solutions below

2
On

I am having trouble making sense of this problem. In your notation, the usual regression model is $$Y_i = a + bx_i + \epsilon_i,$$ where $\epsilon_i$ are distributed $Norm(0, \sigma_\epsilon^2)$, for $i = 1, \dots, n.$

A 95% confidence interval for the slope is $$ b \pm t^* s_\epsilon \sqrt{1/S_{xx}},$$ where $t^* = 3.182$ cuts off area .025 from the upper tail of Student's t distribution with $n - 2 = 3$ degrees of freedom, $$s_{\epsilon}^2 = \left[\sum_i (Y_i - \hat Y_i)^2\right]/(n-2),$$ $S_{xx} = [\sum_i (x_i - \bar x)^2],$ and $\hat Y_i = \hat a + \hat bx_i$ are predicted values from the regression line. Here, $s_\epsilon^2$ estimates $\sigma_\epsilon^2.$

For this model, you would reject the null hypothesis if 0 is not contained in this interval. That is, you would reject if $|\hat b| > t^* s_\epsilon \sqrt{1/S_{xx}} = c.$

The data are sufficiently simple that the computation could be done on a calculator, but some results from Minitab statistical software are shown below for verification. In particular, $b = 1.20,$ $s_{\epsilon} = 0.730297,$ and $s_{\epsilon}\sqrt{1/S_{xx}} = 0.7303\sqrt{1/10} = 0.2309.$

 The regression equation is
 y = - 0.600 + 1.20 x

 Predictor     Coef  SE Coef      T      P
 Constant   -0.6000   0.7659  -0.78  0.491
 x           1.2000   0.2309   5.20  0.014

 S = 0.730297   

The resulting value of $c = 3.182(0.2309) = 0.735 $ seems close (perhaps even within rounding error) to your value 0.7528.

For this usual regression model, $c = 1.47$ cannot be correct. It would indicate that the null hypothesis $H_0: b = 0$ would not be rejected because $b = 1.20.$ Minitab has P-value 0.014, indicating rejection. Also, a look at the regression line through the data pretty clearly shows that a zero slope is absurd.

However, yours in not a standard model because you are given that $\sigma_\epsilon = \sqrt{9} = 3,$ which is not anywhere near the above estimate $s_\epsilon \approx 0.73.$ The only way this value of $\sigma_\epsilon$ could be taken seriously would be to claim that prior experience completely overrides the current data. And in that case, what sense does it make to use the current data to estimate $b?$

I am not sure about the distribution theory of estimating $b$ when $\sigma_\epsilon$ is known. I have experimented with several possibilities that seem reasonable, but none of them gives $c = 1.47.$

0
On

Since $\epsilon_i \sim N(0,9)$, it follows that (see the derivation bellow if you do not undestand why)

\begin{equation} \hat{b}-b \sim N\left(0,\frac{9}{\sum_i{(x_i-\bar{x})^2}}\right), \end{equation}

and, noticing that given the numbers above $\sum_i{(x_i-\bar{x})^2}=16$, this leads to

$$ \sqrt{\frac{16}{9}} (\hat{b}-b) \sim N(0,1).$$

Finally, let $\Phi(x)$ denote the cumulative distribution of the $N(0,1)$, and we can calculate

$$ c = \frac{3}{4}\Phi^{-1}\left(0.975\right)=1.47$$


Derivation of the first equation

The OLS estimate of $b$ is given by

\begin{align} \hat{b} =& \; \frac{\sum_i{(x_i-\bar{x})Y_i}}{\sum_i{(x_i-\bar{x})^2}} \\ =& \; \frac{\sum_i{(x_i-\bar{x}) (a + b x_i + \epsilon_i)}}{\sum_i{(x_i-\bar{x})^2}} \\ =& \; \frac{\sum_i{(x_i-\bar{x}) (a +b\bar{x}+ b (x_i-\bar{x}) + \epsilon_i)}}{\sum_i{(x_i-\bar{x})^2}} \\ =& \; \underbrace{\frac{\sum_i{(x_i-\bar{x})}}{\sum_i{(x_i-\bar{x})^2}}}_{0}(a +b\bar{x})+\underbrace{\frac{\sum_i{(x_i-\bar{x})^2}}{\sum_i{(x_i-\bar{x})^2}}}_{1}b+\frac{\sum_i{(x_i-\bar{x})\epsilon_i}}{\sum_i{(x_i-\bar{x})^2}} \\ =& \; b + \frac{\sum_i{(x_i-\bar{x})\epsilon_i}}{\sum_i{(x_i-\bar{x})^2}}. \end{align}

Since $E[\epsilon_i]=0$, the fact that $\hat{b}$ is an unbiased estimate of $b$, that is

$$E\left[\left(\hat{b}-b\right)\right] = 0,$$

follows straight-forwardly. Moreover, notice that

\begin{align} E\left[\left(\hat{b}-b\right)^2\right] =& \; E\left[\left(\frac{\sum_i{(x_i-\bar{x})\epsilon_i}}{\sum_i{(x_i-\bar{x})^2}}\right)^2\right]\\ =& \; E\left[\frac{\sum_i{(x_i-\bar{x})^2\epsilon_i^2} + \sum_{i\neq j}{(x_i-\bar{x})(x_j-\bar{x})\epsilon_i\epsilon_j}}{\left(\sum_i{(x_i-\bar{x})^2}\right)^2}\right]\\ =& \; \frac{\sum_i{(x_i-\bar{x})^2E[\epsilon_i^2]} + \sum_{i\neq j}{(x_i-\bar{x})(x_j-\bar{x})E[\epsilon_i\epsilon_j]}}{\left(\sum_i{(x_i-\bar{x})^2}\right)^2}\\ =& \; \frac{\sum_i{(x_i-\bar{x})^2}}{\left(\sum_i{(x_i-\bar{x})^2}\right)^2}\underbrace{E[\epsilon_i^2]}_{9}+\frac{\sum_{i\neq j}{(x_i-\bar{x})(x_j-\bar{x})}}{\left(\sum_i{(x_i-\bar{x})^2}\right)^2}\underbrace{E[\epsilon_i\epsilon_j]}_{0}\\ =& \; \frac{9}{\sum_i{(x_i-\bar{x})^2}}.\\ \end{align}


Obs.: the Student's t-test should replace the above procedure only when the variance of the statistic (in this case $\hat{b}$) is not known.