Does the R-Squared value determine the variance in the model for all data points?

106 Views Asked by At

I understand that the R squared value determines how much of the variability in the response variables is determined by the regression model (I think).

I am confused about whether the R squared variable determines the proportion of variance between any two data points or not.

enter image description here

For example, if we take the two points marked above, will the bit marked x account for 70% of the variability (assuming r squared in 0.7), and will the blue parts be the residual 30%? Or is this not correct?

I am confused as my teacher said that the r squared value determines the variance between two points counted for by the model, but wouldn't this be impossible for every two possible points?

1

There are 1 best solutions below

0
On

Consider a simple linear regression model $$Y_i = \beta_0 + \beta_1x + e_i,$$ for $i = 1, 2, \dots, n,$ with $e_i \sim \mathsf{Norm}(0, \sigma_e).$ Let $S_Y^2$ be the variance of the $Y_i$ and $S_{Y|x}^2$ be the variance of the residuals $r_i = Y_i - \hat Y_i.$

Then $$S_{Y|x}^2 = \frac{n-1}{n-2}S_Y^2(1 - r^2),$$ where the coefficient of determination $r^2$ is often printed in computer results as R-sq.

If $r^2 \approx 0,$ then $S_{Y|x}^2 \approx S_Y^2,$ so that the regression is of little value in 'explaining' the variance about the regression line.

By contrast, if $r^2 \approx 1,$ then $S_{Y|x}^2 \approx 0,$ so that most of the $(x_i, Y_i)$ lie on or very near the regression line. This is the basis of the rough statement that "$r^2$ is the proportion of the variance among the $Y_i$'s is explained by regression on $x.$"

The absolute difference between the two values in your graph depends mainly on the slope $\beta_1$ of the true line and on $S_Y^2,$. As long as $\beta_1 \ne 0,$ slope depends on the units used, and not directly on the correlation.