I am trying to prove the well known formula for simple linear regression $$SS_{TOTAL}=SS_{MODEL}+SS_{ERROR}$$ i.e $$\sum_{i=1}^n (y_i - \bar{y})^2 =\sum_{i=1}^n (\hat{y}_i-\bar{y})^2+ \sum_{i=1}^n(\hat{y}_i - y_i)^2 $$ and I'm having more trouble than I care to admit. I go down the following road \begin{align*} \sum_{i=1}^n (\hat{y}_i-\bar{y})^2+ \sum_{i=1}^n(\hat{y}_i - y_i)^2 &= \sum_{i=1}^{n}(\hat{y}_i^2-2\hat{y}_i\bar{y}+\bar{y}^2) +\sum_{i=1}^{n}(\hat{y}_i^2-2y_i\hat{y}_i+y_i^2)\\ &=\sum_{i=1}^n (\hat{y}_i^2)-2\bar{y}^2n + \bar{y}^2n + \sum_{i=1}^n(\hat{y}_i^2) - \sum_{i=1}^n(2y_i\hat{y}_i)+\sum_{i=1}^n(y_i^2)\\ &=\sum_{i=1}^n (y_i^2)-2\bar{y}^2n + \bar{y}^2n + \sum_{i=1}^n(\hat{y}_i^2) - \sum_{i=1}^n(2y_i\hat{y}_i)+\sum_{i=1}^n(\hat{y}_i^2)\\ &=\sum_{i=1}^n (y_i^2-2y_i\bar{y} + \bar{y}^2) + 2\sum_{i=1}^n(\hat{y}_i^2 - y_i\hat{y}_i)\\ &=SS_{TOTAL}+2\sum_{i=1}^n(\hat{y}_i^2 - y_i\hat{y}_i) \end{align*} but I can't see any reason why the right term must be zero. Any help redirecting this ship would be greatly appreciated.
Proof for Simple Linear Regression: What am I doing wrong?
114 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 2 best solutions below
On
The linear regression line of $y$ on $x$ is of the form
$$\hat y=\bar y+a(x-\bar x)$$
, where $a=rs_y/s_x$, $r$ being the correlation coefficient between $x$ and $y$, and $s_y$ and $s_x$ denoting the standard deviations of $y$ and $x$ respectively.
So for the $i$th observation we have $$\hat y_i=\bar y+a(x_i-\bar x)\quad,\,i=1,2,\ldots,n$$
Summing over all observations, $$\sum_{i=1}^n\hat y_i=n\bar y\quad,\quad\text{ i.e. }\quad\overline{\hat y}=\overline y$$
Now we can do the following simple algebra:
\begin{align} y_i&=\hat y_i+(y_i-\hat y_i) \\\implies y_i-\bar y&=(\hat y_i-\overline{\hat y})+(y_i-\hat y_i) \end{align}
That is, $$\sum_{i=1}^n(y_i-\bar y)^2=\sum_{i=1}^n(\hat y_i-\overline{\hat y})^2+\sum_{i=1}^n(y_i-\hat y_i)^2+2\sum_{i=1}^n(\hat y_i-\overline{\hat y})(y_i-\hat y)$$
Now show that the product term vanishes:
\begin{align} \sum_{i=1}^n(\hat y_i-\overline{\hat y})(y_i-\hat y)&=\sum_{i=1}^n(a(x_i-\bar x))\left((y_i-\bar y)-a(x_i-\bar x)\right) \\&=a\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)-a^2\sum_{i=1}^n(x_i-\bar x)^2 \\&=a\left(n\operatorname{Cov}(x,y)-na\operatorname{Var}(x)\right) \\&=a(nrs_xs_y-nrs_xs_y) \\&=0 \end{align}
\begin{align*} \sum_{i=1}^n(\hat{y}_i^2 - y_i\hat{y}_i)&=\sum_{i=1}^n[\hat{y}_i(\hat{y}_i - y_i)]\\ &=\sum_{i=1}^n[(\hat{\beta}_0+\hat{\beta}_1x_{i1}+\cdots +\hat{\beta}_{p}x_{ip})\hat{\varepsilon}_i]\\ &=0 \end{align*} where the last line follows because under OLS each individual term in the sum is zero (the strict exogeneity assumption implies that the mean of the residuals is zero and the correlation between the residuals and the independent variables is zero).
By the way, you can find a different proof of the result on Wikipedia. (It starts from the LHS of your first equation rather than the RHS.)