Coefficient of determination wrong

121 Views Asked by At

Hello I would like to try something I have two vectors :

a = [-1,4,9,16,-25,36,49,64,81,100] b = [80,60,12,52,74,325,146,17,745,54]

And I would like to look at the coefficient of determination if a is the model and b the data for instance.

I found as a coefficient of determination if I use this formula :

$$\frac{\sum_{i=0}^{9}(a(i)-\bar{b})^2}{\sum_{i=0}^{9}(b(i)-\bar{b})^2} $$

I get 0.36 so it is not good but it is correct because the coefficient of determination must be between 0 and 1.

But now if I do the opposite case I mean if I take a as the data and b as the model and using the formula :

$$\frac{\sum_{i=0}^{9}(b(i)-\bar{a})^2}{\sum_{i=0}^{9}(a(i)-\bar{a})^2} $$

I get 42.85 but it is really strange because the coefficient of determination must be between 0 and 1. Basically if it is between 0.95 and 1 it is okay else it is not a good model. But in my case the coefficient of determination is higher than 1 so there is a problem.

Thank you very much for your help !!!

PS: of course the abscissa are the same !

1

There are 1 best solutions below

0
On

I find the use of 'model' and 'data' to be confusing, and I do not understand the notation in your two displayed formulas.

However, the coefficient of determination is the square of the correlation $r.$ And the correlation between $a$ and $b$ is the same as the correlation between $b$ and $a.$

Specifically for your data, the correlation is $r_{a,b} = r_{b,a} = 0.406$ and the coefficient of determination is $r^2 = 0.1649.$ My scratchwork in R statistical software is shown below (maybe check for typos in the data). Also, you should check your textbook for the formula used to compute $r$ (the notation will likely be in terms of $x$ and $y,$ not $a$ and $b$).

b = c(80,60,12,52,74,325,146,17,745,54)
a = c(-1,4,9,16,-25,36,49,64,81,100)
cor(a,b)
## 0.406116
cor(b,a)
## 0.406116
cor(a,b)^2
## 0.1649302

Note: If you are talking about the regression of $a$ on $b$ vs. the regression of $b$ on $a$, then the regression lines will not be the same. Each will have its own estimated slope and intercept. However, the $r^2$ values (sometimes denoted R-SQ in regression printout) will be the same for both regressions.

Here is output for both regressions from R statistical software: First, regression of $a$ on $b$ (attempting to predict a-values from b-values).

summary(lm(a ~ b))

Call:
lm(formula = a ~ b)

Residuals:
   Min     1Q Median     3Q    Max 
-52.40 -20.28  -9.59  13.73  74.04 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.09845   15.10290   1.463    0.182
b            0.07158    0.05694   1.257    0.244

Residual standard error: 38.56 on 8 degrees of freedom
Multiple R-squared: 0.1649,     Adjusted R-squared: 0.06055 
F-statistic:  1.58 on 1 and 8 DF,  p-value: 0.2442 

Notice that 'Multiple R-squared' = $r^2 = 0.1649,$ as above. ('Adjusted R-squared' is something else.) Now for regression of $b$ on $a$.

summary(lm(b ~ a))

Call:
lm(formula = b ~ a)

Residuals:
    Min      1Q  Median      3Q     Max 
-256.20  -82.54  -37.83   39.51  478.59 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   79.767     92.267   0.865    0.412
a              2.304      1.833   1.257    0.244

Residual standard error: 218.8 on 8 degrees of freedom
Multiple R-squared: 0.1649,     Adjusted R-squared: 0.06055 
F-statistic:  1.58 on 1 and 8 DF,  p-value: 0.2442 

Once again, as claimed: $r^2 = 0.1649.$