Use Pearson Correlation to Assess Nonlinear Association?

1.1k Views Asked by At

I have two datasets $\{x_1, x_2, ..., x_n\}$ and $\{y_1, y_2, ..., y_n\}$. Each pair $(x_i, y_i)$ represents two properties of a system. Here is the visualization:

enter image description here

The Pearson's Correlation Coefficient of the two datasets is 0.66, which is not low.

I also sorted $\{y_1, y_2, ..., y_n\}$ and the corresponding elements in $\{x_1, x_2, ..., x_n\}$ are rearranged accordingly, such that the "partnership" of any pair $(x_i, y_i)$ remains. Here is the figure:

enter image description here

We can see that $\{y_1, y_2, ..., y_n\}$ follows a power law distribution and can also clearly observe the relationship between $\{x_1, x_2, ..., x_n\}$ and $\{y_1, y_2, ..., y_n\}$.

My question is: is Pearson's Correlation Coefficient suitable for describing the relationship between the two datasets in this case? Are there any other quantitative ways to describe it?

1

There are 1 best solutions below

4
On

Pearson's correlation $r$ measures the linear component of association.

Spearman's correlation uses ranks instead of the values. Roughly speaking, it measures the degree to which the two variables rise (or fall) together, regardless whether the relationship linear.

Here are fake data in which $y = x + e,$ where $e \sim \mathsf{Norm}(\mu=0, \sigma=4).$ Thus the relationship is basically linear, but with some noise. Pearson's and Spearman's correlations are both about 0.97.

x = 1:50;  e = rnorm(50, 0, 4);  y = x+e
cor(x,y);  cor(x,y, meth="spear")
## 0.9706131  # Pearson
## 0.9732053  # Spearman

However, if we consider the nonlinear relationship between $x$ and $y^3,$ Pearson's correlation is only about 0.84, whereas Spearman's correlation is unchanged.

cor(x,y^3); cor(x,y^3,meth="spear")
## 0.8389872  # Pearson
## 0.9732053  # Spearman

Here are plots of the two relationships.

enter image description here

You believe you have an exponential relationship in your data. That would be relatively farther from linear than the cubic relationship illustrated here. Bottom line: I think you should consider using Spearman's correlation.

You can read specific information about Spearman's correlation online or in a basic statistics text. Briefly, it is Pearson's correlation of ranks.

rx = rank(x);  ry = rank(y);  cor(rx,ry)
## 0.9732053