Is it possible to obtain a value for errors in measured quantities from R squared regression value?

54 Views Asked by At

I conducted an experiment which it was not possible to repeat, however it gave me a series of 15 points $(x_i,y_i)$. I applied a linear regression to this data and obtained a high R squared value of $R^2=0.9991$

Alhtough I was not able to repeat my measurements, I know that they were subject to some random error and I was wondering if it is possible to work backwards from the R squared value to find an estimate for the error in each data point.

Unfortunately I am not very experienced when it comes to statistics, and what I have read about this so far has been very confusing.

1

There are 1 best solutions below

0
On

Suppose your simple linear regression model is $$Y_i = \beta_0 + \beta_1x_i + e_i,$$ where $e_i$ are iid $Norm(0, \sigma_e).$ Also, I suppose that you have $n$ data pairs $(x_i, Y_i),$ which you use to find estimates $\hat \beta_0$ of $\beta_0.\,$ $\hat \beta_1\,$ of $\beta_1,$ and $s_e^2$ of $\sigma_e^2.$ Then the regression line passes through the $n$ points $(x_i, \hat Y_i),$ where $\hat Y_i = \hat \beta_0 + \hat \beta_1 x_i.$

A 95% confidence interval for a fitted value $\hat Y_i$ corresponding to $x_i$ is of the form $$\hat Y_i \pm t^*s_e\sqrt{\frac{1}{n} + \frac{(x_i - \bar x)^2}{(n-1)s_x^2}},$$ where $\bar x$ and $s_x^2$ are the sample mean and variance, respectively, of the $x_i,\,$ $t^*$ cuts area .025 from the upper tail of Student's t distribution with $n-2$ degrees of freedom, and $s_e$ is the above mentioned estimate of $\sigma_e^2$ of the regression model. This estimate can be found as $$s_e^2 = \frac{\sum_{i=1}^n(Y_i - \hat Y_i)^2}{n-2} = \frac{n-1}{n-2}s_Y^2(1 - r^2).$$

If you did the regression procedure using statistical software, then $s_e$ or $s_e^2$ may have been part of the regression printout. In the right-hand expression above, $s_Y^2$ is the sample variance of the $Y_i$ and $r^2$ (possibly R-sq or '$R^2$' in the regression printout) is the square of the correlation $r$ between the $x_i$ and $Y_i.$ [If you use that expression for computation, you must not round anything off before you get the result $s_e^2.$]

So you can use the value $R^2$ you mentioned to get a confidence interval for your predicted points, provided you have the original data available so that the other quantities in the formula can be computed also.

Notice that errors become larger as the $x_i$s get farther from their average $\bar x.$ This is because the regression line must pass through the point $(\bar x, \bar Y),$ and any error in estimating the slope $\beta_1$ of the regression line is exaggerated as one moves away from this 'center of the data cloud'.

Notes: (1) The formula for the confidence interval (CI) above is fine if you are talking about errors in the $\hat Y_i$s from the original data. If you are trying to predict the value $Y_{n+1}$ corresponding to an additional $x_{n+1}$, not used in finding the regression line, then the prediction interval (PI) is somewhat longer: $$\hat Y_{n+1} \pm t^*s_e\sqrt{1 + \frac{1}{n} + \frac{(x_{n+1} - \bar x)^2}{(n-1)s_x^2}}.$$

(2) Almost all of the formulas I have given here are standard ones found in most elementary statistics texts, except I have tried to avoid any unnecessary terminology and symbols. I am not surprised that you had difficulty finding just what you needed in your own browsing. I hope it helps to have just the formulas and symbols you actually need for your task. If you need further assistance, please edit into your question separate lists your 15 x-values and 15 Y-values, each along its own row, with values separated by commas. Then leave me a Comment, and I will try to answer within a day or two.