I'm learning OLS estimator with difficulty with computing the $R^2$. First are the notations used in my lecture note:
$X_{i} \equiv\left(\begin{array}{c}{X_{i 1}} \\ {X_{i 2}} \\ {\vdots} \\ {X_{i K}}\end{array}\right)$ of size $K \times 1$, $\beta \equiv\left(\begin{array}{c}{\beta_{1}} \\ {\beta_{2}} \\ {\vdots} \\ {\beta_{K}}\end{array}\right)$ of size $K \times 1$, and $\epsilon \equiv\left(\begin{array}{c}{\epsilon_{1}} \\ {\epsilon_{2}} \\ {\vdots} \\ {\epsilon_{n}}\end{array}\right)$ of size $n \times 1$.
$X \equiv\left(\begin{array}{cccc}{X_{11}} & {X_{12}} & {\cdots} & {X_{1 K}} \\ {X_{21}} & {X_{22}} & {\cdots} & {X_{2 K}} \\ {\vdots} & {\vdots} & {\vdots} & {\vdots} \\ {X_{n 1}} & {X_{n 2}} & {\cdots} & {X_{n K}}\end{array}\right)$ of size $n \times K$, and $Y \equiv\left(\begin{array}{c}{Y_{1}} \\ {Y_{2}} \\ {\vdots} \\ {Y_{n}}\end{array}\right)$ of size $n \times 1$
My model is $Y_{i}=X_{i}^{\prime} \beta+\epsilon_{i}$ for $i=1, \ldots, n$ or equivalently $Y=X \beta+\epsilon$. From FOC, we have $$\hat{\beta} =\left(X^{\prime} X\right)^{-1} X^{\prime} Y =\left(\sum_{i=1}^{n} X_{i} X_{i}^{\prime}\right)^{-1}\left(\sum_{i=1}^{n} X_{i} Y_{i}\right) =\left(\frac{1}{n} \sum_{i=1}^{n} X_{i} X_{i}^{\prime}\right)^{-1}\left(\frac{1}{n} \sum_{i=1}^{n} X_{i} Y_{i}\right)$$
Let $P_X = X\left(X^{\prime} X\right)^{-1} X^{\prime}$. Then the OLS fitted values $\hat{Y} \equiv X \hat{\beta}= P_XY$ and $\hat \epsilon \equiv Y - \hat Y$.
Then I have an exercise:
If there is intercept, i.e. $X_{i1} = 1$ for all $i=1,\ldots,n$, then $$\sum_{i=1}^{n}\left(Y_{i}-\overline{Y}\right)^{2} - \sum_{i=1}^{n} \hat{\epsilon}_{i}^{2}=\sum_{i=1}^{n}\left(\hat{Y}_{i}-\overline{Y}\right)^{2}$$ where $$\overline Y = \frac{1}{n} \sum_{i=1}^{n} Y_{i}$$
My attempt:
We have $$\begin{aligned} \sum_{i=1}^{n} (Y_{i}-\overline{Y})^{2} - \sum_{i=1}^{n} \hat{\epsilon}_{i}^{2} &= \sum_{i=1}^{n}(Y^2_{i}-2Y_i\overline{Y} +\overline{Y}^2) - \sum_{i=1}^{n} {(Y_i - \hat Y_i)}^{2}\\ &= \sum_{i=1}^{n} (Y^2_{i}-2Y_i\overline{Y} +\overline{Y}^2) - \sum_{i=1}^{n} (Y^{2}_i -2Y_i \hat Y_i+ \hat Y^2_i) \\&= \sum_{i=1}^{n} Y_i^2 -2\overline Y \sum_{i=1}^{n} Y_i +n \overline Y^2 - \sum_{i=1}^{n} Y_i^2 +2\sum_{i=1}^{n} Y_i \hat Y_i - \sum_{i=1}^{n} \hat Y_i^2 \\ &= -2 \overline Y n\overline Y + +n \overline Y^2 +2\sum_{i=1}^{n} Y_i \hat Y_i - \sum_{i=1}^{n} \hat Y_i^2 \\ &=-n\overline Y^2 +2\sum_{i=1}^{n} Y_i \hat Y_i - \sum_{i=1}^{n} \hat Y_i^2\end{aligned}$$
After that, I'm stuck at using the fact that the model has intercept. Could you please help me finish the proof? Thank you so much!
You can start by showing the orthogonal decomposition, i.e., $$ \sum ( Y_i - \bar{Y})^2 = \sum(\hat{Y}_i - \bar{Y})^2 + \sum \hat{\epsilon}_i^2 $$ and then just rearrange the equation. So, start with \begin{align} \sum ( Y_i - \bar{Y})^2& = \sum ( Y_i - \hat{Y}_i + \hat{Y}_i - \bar{Y})^2\\ &= \sum ( \hat{Y}_i - \bar{Y})^2 + \sum ( \hat{Y}_i - Y_i)^2 + 2\sum(\hat{Y}_i - \bar{Y})(\hat{Y}_i-Y_i) \end{align} where $$ \sum(\hat{Y}_i - \bar{Y})(\hat{Y}_i-Y_i) = \sum (X_i'\hat{\beta}-\bar{Y})\hat{\epsilon}_i=\hat{\beta}'\sum X_i\hat{\epsilon}_i-\bar{Y}\sum\hat{\epsilon}_i=0-0=0. $$ where the zeroes come from the gradient of the loss function. Or algebraically due to the fact that the columns' space of $X$ is orthogonal to the errors $\hat{\epsilon}$, and $\sum \hat{\epsilon}_i = 0$ is the first term in the gradient of the loss function (i.e., partial derivative w.r.t. the intercept term $\beta_0$). Namely, the target function is $\arg\max \|Y-X\beta\|^2$, thus taking derivative w.r.t $\beta_0$ gives you $$ -2\sum (Y_i - \hat{\beta}_0 - \sum_{j=1}^k\hat{\beta}_j X_j) = -2 \sum\hat{\epsilon}_i = 0. $$