How to quantify how much my data resembles a linear relationship?

556 Views Asked by At

I have a bunch of data points in Excel for different test subjects that are each represented by unique colors in the following graph:

Graph of different subjects each colored uniquely

I wish to quantify in Excel what seems obvious to my eyes, that the orange data set is linear, whereas most of the other datasets appear to be non-linear, perhaps quadratic in form. It appears that the Coefficient of determination may do what I want, but that seems more like a statistical method, perhaps there is another method better suited to the data shown here that appears more of a quadratic form versus a linear form. How can I quantify this relationship that seems so obvious to one's eye when viewing the data points?

1

There are 1 best solutions below

2
On BEST ANSWER

CORRELATION. The simplest thing for the orange data would be to compute the (ordinary Pearson) sample correlation $r$ of x and Y. Roughly speaking, it measures the linear component of the association. You can probably do that quickly with software. A value very near 1 indicates an almost perfect fit to a line. The coefficient of determination is the square $r^2$ of the correlation. Even if the association were negative, not here, the coefficient of determination would be positive. For any data, $-1 \le r \le 1$ and $0 \le r^2 \le 1.$ (There are statistical tests whether the underlying population correlation is 0, if you want to get into that.)

For data points of other colors you can find the Spearman correlation $r_s.$ Roughly speaking, it measures the regularity with which y increases when x increases (without regard to linearity). For most, if not all, of the colors, it looks as if the Spearman correlation will be very nearly 1.

LINEAR REGRESSION. Another approach for the orange data is to fit a regression line through the data. The fit should be almost exact (the 3rd and/or 4th points may miss by a bit). The model is $$Y_i = \beta_0 + \beta_1 x_i + e_i.$$ The regression procedure will give you estimates of the y-intercept $\beta_0$ (perhaps slightly negative) and the slope $\beta_1$. The model assumes the random error $e_i$ is normally distributed with mean 0. The estimate of its variance will be very small.

QUADRATIC REGRESSION. For the other colors, you could do a quadratic regression. For each color separately, the model is $$Y_i = \beta_0 + \beta_1 X_i + \beta_2 x_i^2 + e_i.$$ Again here, you will get estimates for the $\beta$'s. I imagine you will find that the fit to this model will be very good, and that the fit will be much worse if you remove the quadratic term $\beta_2 x_i^2.$

Notes: (1) It is difficult to tell from your graph, but it looks as if the orange line and the other curves might logically go through the point $(0,0).$ If this is a natural constraint for your data, then you may want to remove the term $\beta_0$ from both regression models.

(2) I have just had a look at Wikipedia 'correlation coefficient', It has links to 'Pearson' and 'Spearman', with formulas for both types of sample correlation. (It is NOT necessary to do a regression first.) Also, some nice scatterplots of different kinds of data and some useful intuitive comments.

(3) This site is not a statistical consulting service, but if you can give me the data for, say, the orange line and the green curve, I could do the two regressions and explain essential features of the output. If you are free to do so, it would be helpful if you could briefly give the story behind the data and the units of measurement for x and y.