I stumbled upon the following formula for the coefficient of determination:
$$1-R_{y(x_1,x_2...x_n)}^2=\left(1-\rho_{y,x_1}^2\right)\left(1-\rho_{y,x_2(x_1)}^2\right)\left(1-\rho_{y,x_3(x_1,x_2)}^2\right)\,\cdots\,\left(1-\rho_{y,x_n(x_1,x_2...x_{n-1})}^2\right).$$
where $R_{y(x_1,x_2,...,x_n)}$ is the coefficient of determination associated with the multiple linear regression between $y$ and ${x_1,x_2,...,x_n}$ and $\rho_{y,x_p(x_1,x_2,...,x_{p-1})}$ is the partial correlation between $y$ and $x_p$ controlling for $x_1,x_2,...,x_{p-1}$. Although this intuitively makes sense, would anyone have a proof of this formula?
I had a go by starting with the regression model: $$y=\mathbf{\beta}^T\mathbf{x}+\epsilon$$ where $\mathbf{\beta}$ is the vector of regression coefficients and $\epsilon$ is the error term. Then $$1-R_{y(x_1,x_2,...,x_n)}^2=E[\epsilon^2]/\sigma_y^2$$ where $\sigma_y^2$ is the variance of y. One can then try look at: $$1-R_{y(x_1,x_2,...,x_n)}=\frac{1}{\sigma_y^2}E[(y-\mathbf{\beta}\mathbf{x})^T(y-\mathbf{\beta}\mathbf{x})]$$ by rewriting $\mathbf{\beta}$ in terms of the correlations and standard deviations between the explanatory variables $\mathbf{x}$ and dependent variable $y$. However this seems very long winded especially as then the result would have to be refactored in terms of the partial correlations which the final formula has. So I was wondering if anyone knows a better, perhaps recursive/inductive approach, starting with 1 variable regression and adding more.
Thanks for the help!
I think I managed a proof by induction, let me know what you think.
Statement to prove for all positive integers $n$: $$1-R_{y(x_1,x_2...x_n)}^2=\left(1-\rho_{y,x_1}^2\right)\left(1-\rho_{y,x_2(x_1)}^2\right)\left(1-\rho_{y,x_3(x_1,x_2)}^2\right)\,\cdots\,\left(1-\rho_{y,x_n(x_1,x_2...x_{n-1})}^2\right).$$
$$y=\sum_i^k\left(\beta_{y,x_i(\{x\}_{k+1}\setminus \{x_i\})}x_i\right)+\beta_{y,x_{k+1}(\{x\}_k)}\left(x_{k+1}-\sum_i^k\beta_{x_{k+1}x_i(\{x\}_k\setminus x_i)}x_i\right)+\beta_{y,x_{k+1}(\{x\}_k)}\sum_i^k\beta_{x_{k+1}x_i(\{x\}_k\setminus x_i)}x_i+\epsilon_{k+1}$$ $$y-\sum_i^k\left(\beta_{y,x_i(\{x\}_{k+1}\setminus \{x_i\})}x_i\right)-\beta_{y,x_{k+1}(\{x\}_k)}\sum_i^k\beta_{x_{k+1}x_i(\{x\}_k\setminus x_i)}x_i=\beta_{y,x_{k+1}(\{x\}_k)}\left(x_{k+1}-\sum_i^k\beta_{x_{k+1}x_i(\{x\}_k\setminus x_i)}x_i\right)+\epsilon_{k+1} $$ $$\epsilon_k=\beta_{y,x_{k+1}(\{x\}_k)}\left(x_{k+1}-\sum_i^k\beta_{x_{k+1}x_i(\{x\}_k\setminus x_i)}x_i\right)+\epsilon_{k+1}$$
where we have used
$$\sum_i^k\left(\beta_{y,x_i(\{x\}_{k+1}\setminus \{x_i\})}x_i\right)+\beta_{y,x_{k+1}(\{x\}_k)}\sum_i^k\beta_{x_{k+1}x_i(\{x\}_k\setminus x_i)}x_i=\sum_i^k\left(\beta_{y,x_i(\{x\}_{k}\setminus \{x_i\})}x_i\right)$$
This is an intuitive result but probably can be proved by induction as well. We have essentially collapsed the $n=k+1$ regression model into the $n=1$ regression model using the error terms. Therefore using a similar proof to 1. we can see that
$$E[\epsilon_{k+1}^2]=E[\epsilon_k^2](1-\rho_{y,x_k+1(x_1,x_2,...,x_k)})$$
Using the relation between the determination coefficient and the error terms we get:
$$1-R_{y(x_1,x_2...x_{k+1})}^2=(1-R_{y(x_1,x_2...x_{k})}^2)(1-\rho_{y,x_k+1(x_1,x_2,...,x_k)})$$
Then, using 2. on the RHS, we get the desired result:
$$1-R_{y(x_1,x_2...x_{k+1})}^2=\left(1-\rho_{y,x_1}^2\right)\left(1-\rho_{y,x_2(x_1)}^2\right)\left(1-\rho_{y,x_3(x_1,x_2)}^2\right)\,\cdots\,\left(1-\rho_{y,x_k+1(x_1,x_2,...,x_k)}\right)$$
Hence we have proven that if true for $n=k$, then the statement is true for $n=k+1$ and since it is true for $n=1$, it is true for all positive integers $n$.