If the coefficient of determination is a measure based on variance, then what about standard deviation instead?

57 Views Asked by At

I originally posted this question: https://stats.stackexchange.com/questions/435489 but haven't got an answer. As I'm interested in the Mathematical reasoning, I thought I would post here as well.

I'm examining the Coefficient of Determination ($r^2$) for linear correlation/regression. The unexplained variation: $1-r^2=\frac{Var(y-\hat{y})}{Var(y)}$ is a ratio of variances. My question is: why use Variance and not Standard Deviation here? Variance is a somewhat abstract measure since it's value has no specific meaning, whereas Standard Deviation is a more concrete measure of how a given variable varies.

Since $Var(x)=\sigma_x^2$ then $$\frac{\sigma_{y-\hat{y}}}{\sigma_y}=\sqrt{1-r^2}$$ should be the unexplained Standard Deviation. However the value of $\sqrt{1-r^2}$ will always be a bigger value than $1-r^2$ (and doesn't look as good). In fact, as a function of $r$, the explained SD would be $1-\sqrt{1-r^2}$ which is a quarter circle compared to the explained Variance $r^2$ which is a parabola.

Example: if $r=0.9$ then $r^2=0.81$ and so there is an 19% unexplained variance. However $\sqrt{0.19}\approx0.436$ meaning there is a whopping 43.6% unexplained Standard Deviation (or there is only 56.4% explained by the model) which seems quite bad given the original correlation was quite good.

So, I guess this explains why we don't use Standard Deviation in place of Variance, but then that just leads to the question of why doesn't a ratio of Standard Deviation work well here? It seems a reasonable thing to calculate, but clearly it doesn't give a reasonable measure of how well the data fits a linear model.

1

There are 1 best solutions below

2
On

You say

if $r=0.9$ then $r^2=0.81$ and so there is an 19% unexplained variance. However $\sqrt{0.19}\approx0.436$ meaning there is a whopping $43.6\%$ unexplained Standard Deviation (or there is only $56.4\%$ explained by the model)

but since $\sqrt{0.81}=0.9$, you might have said that there is $90\%$ standard deviation explained by the model. That too would have been wrong, and illustrates why we don't use standard deviation in place of variance

As to why the unsquared numbers do not add up when squares do, consider a general case with $a,b,c$ all positive and $$a^2+b^2=c^2$$ where in your example $a^2=0.81, b^2=0.19, c^2=1$. You then have $$(a+b)^2=a^2+2ab+b^2 = c^2+2ab > c^2$$ and so $$a+b > c$$ (in your example $a+b\approx 0.9+0.436 = 1.336 >1$)