Pearson's R and Correlation formula

659 Views Asked by At

I'm trying to make sense out of Pearson's $R$ and Pearson's correlation coefficient. I'm not sure I really see a difference.

Let me just clear out any confusion, for me Pearson's $R$ is:

$$ R = \frac{\sum xy}{\sqrt{\sum x^2 \cdot \sum y^2}} $$

And Pearson's correlation coefficient is:

$$ C = \frac{n\sum xy - \sum x \sum y}{\sqrt{(n\sum x^2 - (\sum x)^2)(n\sum y^2 - (\sum y)^2)}} $$

The way I see it is that C gets us something between $-1$ and $1$. Then we can test its significativity, in other words if our coefficient means something and if we should keep it.

But my problem is I don't see a conceptual difference with Pearson's R. R is also between $-1$ and $1$ and is also an indicator of relationship between the two variables. But it gives really different results than the coefficient. Are they just two different unlinked ways of getting to the correlation? I am still confused that results with both are really different. I am sure I am missing something on how to use R or C or something even in the bigger picture.

Thanks for any help in advance.

2

There are 2 best solutions below

0
On

If your data $x_1, \dots, x_n$ have mean $0$ (and the $y$ values satisfy this as well) then $R$ and $C$ are equal. The coefficient you define as $C$ should always be used as Pearson's $r$ (a.k.a. the correlation coefficient), but in the specific case where $x, y$ have 0 means you can use the simpler formula $R$, since $C = R$ in this case.

0
On

If the average of the $x$ values and the average of the $y$ values are both $0$, then $R$ and $C$ are both the same.

The one you call $C$ is the one I've always seen called $R$.

One circumstance in which it makes sense to use $R$ rather than $C$ is when the $x$- and $y$-values are both small samples from a large population and you KNOW that the population mean is $0$. In that case a positive value of $R$ would tell you that any bias in your sampling process that causes either $x$ or $y$ to deviate from the population average (e.g. you get above-average or below-avergae $x$-values) would also cause the other variable to deviate in the SAME direction (if one of them is above average then so is the other, and if one is below average, then so is the other), and a negative R value would mean that they would tend to deviate in opposite directions (if one is above average then the other is below average).