Regression Line from experimental data with uncertainty: calculate reg. line coefficients sigmas

542 Views Asked by At

I have $10$ values with the corresponding errors ($\sigma$) like $y \pm \sigma_y$, the x value is the timestamp of the measurement so no $\sigma_x$. I can easily calculate the regression line but I need also to calculate the errors of the coefficients $A,B$ of the regression line $Ax+B = y$. I have some formulas I usually use, but they are based on calculating the variance of the regression line from the experimental data like

$$\sigma_y = \sqrt{ \frac{\sum {(y - Ax - B)^2}}{n-2}}$$ $$ \sigma_A = \sigma_y\sqrt{\frac{\sum{x^2}}{\Delta}} $$ $$\sigma_B = \sigma_y \sqrt{\frac{n}{\Delta}}$$ $$\Delta = n\sum{x^2}-(\sum{x})^2$$

these formulas use only the value of the experimental data ($y$) but don't care about the error of the experimental data, that I do need to propagate $\sigma_y$. How can I do it? Thank you

1

There are 1 best solutions below

0
On

You may be trying to use simple linear regression for unintended purposes. I will try to describe how simple linear regression handles 'random error'.

Model: From the equations you give, I believe you are using the following regression model:

$$Y_i = \alpha x_i + \beta + e_i,$$ for $i = 1, 2, \dots, n,$ where $x_i$ are known constants, $e_i$ are independent 'errors' distributed as $\mathsf{Norm}(0, \sigma),$ and $\alpha, \beta,$ and $\sigma$ are unknown parameters to be estimated. Notice that $\sigma$ is the same for all values of $i.$ One may think of this equation as a way to express values of $Y$ in terms of a linear relationship with corresponding $x$'s.

Parameter estimates: The slope $\alpha$ is estimated by $$\hat \alpha = A = S_{xY}/S_{xx}.$$ the intercept $\beta$ is estimated by $$\hat \beta = B = \bar Y - A\bar x,$$ and the error variance $\sigma^2$ is estimated by $$\hat \sigma^2 = S_{Y|x}^2 = \frac{1}{n-2}\sum_{i=1}^n (Y_i - \hat Y_i)^2,$$ where $\bar x = \frac 1 n \sum_{i=1}^n x_i,\,$ $\bar Y = \frac 1 n \sum_{i=1}^n Y_i,\,$ $SS_{xx} = \sum_{i=1}^n (x_i - \bar x)^2,\,$ $SS_{xY} = \sum_{i=1}^n (x_i - \bar x)(Y_i - \bar Y),\,$ and $\hat Y_i = Ax_i + B.$

Strictly speaking, the $x_i$'s are regarded as being observed without appreciable error. Typically they may be times (in hours, days, years, etc.), dosages, temperatures, and so on. The $Y_i$'s (perhaps earnings, tumor shrinkage, octane) are regarded as being mainly a linear function of the $x_i$'s, with errors representing some combination of measurement errors in $Y$ and other unexplained effects.

Estimation. As shown above, random quantities $A, B,$ and $\hat \sigma$ are estimates of unknown constants $\alpha, \beta,$ and $\sigma.$ One can find confidence intervals for $\alpha$ and $\beta$ in terms of the standard errors of $A$ and $B$ (as in your Question), using Student's t distribution with $n - 1$ degrees of freedom. And one can find a confidence interval for $\sigma$ using the fact that $(n-2)\hat\sigma^2/\sigma^2 \sim \mathsf{Chisq}(df=n-2).$

Typically, various 'diagnostic procedures' are used to assess whether the assumptions of linearity, independence, normal distribution, and constant variability across values of $i$ are reasonable. (If $A$ is not significantly different from $0,$ the data do not show a useful linear relationship for predicting $Y$'s from $x$'s.)

Modeling variability. In case the $x_i$ are subject to minor measurement errors, one may feel comfortable supposing that these errors are somehow adequately expressed as a component of $\sigma^2$ and that one may still use the regression line $Y = AX + B.$ to predict values of $Y$ corresponding to newly observed values of $x.$ Most elementary texts that cover 'simple linear regression' have a formula for 'prediction intervals' for values of $Y$ predicted in this way:

$$Y_{n+1} \pm t^*S_{Y|x}\sqrt{1 + \frac 1 n + \frac{(x_{n+1} - \bar x)^2}{SS_{xx}}},$$

where $t^*$ is a percentage point of Student's t distribution with $n-2$ degrees of freedom that reflects the desired level of confidence.

Under the radical in the margin for prediction error in the formula above: $1$ represents additional variability in a new observation not used to determine the regression line, $\frac 1 n$ represents the component of error due to estimation error in the Y-intercept $B,$ and the third term represents the component of error due to estimation error in the slope $A.$ Notice that the effect of an error in slope is magnified if the new $x_{n+1}$ is far from the mean $\bar x$ of the previously observed $x_i$ (relative to the spread of those $x_i).$ [The regression line $Y=Ax+B$ passes through the 'center of gravity' $(\bar x, \bar Y)$ the data cloud, where there is no error due to estimation of $\alpha$ by $A.]$

Variability not accounted for: By contrast, if errors in measuring $x$ are appreciable, or if the other assumptions mentioned above do not hold, then a different kind of regression model should be used.

The bottom line is that the error term $e_i$ is used to express imprecision in measuring $Y$ or in the assumed relationship of $Y$ to $x.$ If this is not an appropriate way to account for error, there is no use arguing with or trying to 'adjust' the simple linear regression model. One simply needs to use a model that does adequately express the variability of interest.