Pearson's correlation formula - intuition behind the definition of the formula.

471 Views Asked by At

$$ r = \frac{ \sum z_x z_y }{n-1}\,, $$

where $$z_x = \frac{x_i - \bar{x}}{\sigma_x}$$ and

$$z_y = \frac{y_i - \bar{y}}{\sigma_y}$$

I came across the above formula for correlation when reading a statistics textbook. I have an intuitive understanding of what correlation is and why it is a defined statistic/parameter. What I don't understand is why the above formula for calculating the correlation coefficient is defined this way.

Isn't the correlation coefficient meant to be a measure for correlation between the values $ -1 \leq r \leq 1 $ where r is the correlation coefficient? How does the above formula scale the value of the correlation coefficient so that for every possible distribution of two quantitative variables (x and y) it is always between $ -1 \leq r \leq 1 $ Couldn't you have a z-score of 2 and 3, which when multiplied together will give 6, causing the numerator to be greater than the denominator?

Also, is this the only way to define the formula for the correlation coefficient, I have seen other formulas for the correlation coefficient in different textbooks and got confused as to why there is more than one definition for the formula for the correlation coefficient.

2

There are 2 best solutions below

2
On

You are going to have to tell us how $z_x$ and $z_y$ are defined. The usual definition of the correlation coefficient is the covariance divided by the square root of the product of the variances. The $n-1$ term in your expression is an indication that unbiased sample variance estimators are being used.

So you should have $\rho_{x,y} = \dfrac{\sigma_{x,y}}{\sqrt{\sigma^2_x \sigma^2_y}}$ but since you are using a sample, you instead have $r_{x,y} = \dfrac{s_{x,y}}{\sqrt{s^2_x s^2_y}}$, and it seems you are using

  • $s_{x,y}=\frac{1}{n-1}\sum_i (x_i-\bar{x})(y_i-\bar{y})$
  • $s^2_{x}=\frac{1}{n-1}\sum_j (x_j-\bar{x})^2$
  • $s^2_{y}=\frac{1}{n-1}\sum_k (y_k-\bar{y})^2$
  • $z_{x_i}=\dfrac{(x_i-\bar{x})}{\sqrt{\frac{1}{n-1}\sum_j (x_j-\bar{x})^2 }}$
  • $z_{y_i}=\dfrac{(y_i-\bar{y})}{\sqrt{\frac{1}{n-1}\sum_k (y_k-\bar{y})^2 }}$

So

$$r_{x,y} = \dfrac{s_{x,y}}{\sqrt{s^2_x s^2_y}} \\= \dfrac{\frac{1}{n-1}\sum_i (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\frac{1}{n-1}\sum_j (x_j-\bar{x})^2 \frac{1}{n-1}\sum_k (y_k-\bar{y})^2}} \\= \frac1{n-1} \sum_i \frac{(x_i-\bar{x})}{\sqrt{\frac{1}{n-1}\sum_j (x_j-\bar{x})^2 }} \frac{(y_i-\bar{y})}{\sqrt{\frac{1}{n-1}\sum_k (y_k-\bar{y})^2 }} \\ = \frac1{n-1} \sum_i z_{x_i} z_{y_i}$$

0
On

As another answer has shown, for $n > 1,$

$$ r = \frac1{n-1} \sum_i z_{x_i} z_{y_i} = \frac{\frac1{n-1}\sum_i (x_i-\bar x)(y_i-\bar y)} {\sqrt{\frac1{n-1}\sum_j (x_j-\bar x)^2 \frac1{n-1}\sum_k (y_k-\bar y)^2}}, $$ that is, $$ r = \frac{\sum_i (x_i-\bar x)(y_i-\bar y)} {\sqrt{\sum_j (x_j-\bar x)^2 \sum_k (y_k-\bar y)^2}}. \tag1 $$

Now define $a_i = x_i-\bar x$ and $b_i = y_i-\bar y$ for $i = 1,\ldots,n.$ Then $$ r = \frac{\sum_i a_i b_i}{\sqrt{\sum_j a_j^2 \sum_k b_k^2}}. $$

The statement that $-1 \leq r \leq 1$ is equivalent to $r^2 \leq 1,$ which is equivalent to $$ \left(\sum_i a_i b_i \right)^2 \leq \sum_j a_j^2 \sum_k b_k^2, $$ which is the Cauchy-Schwarz inequality. You can find proofs of this inequality in Proof of Cauchy–Schwarz inequality and Proof of Cauchy-Schwarz Inequality.

I cannot take credit for this answer, however, because it is just a simple adaptation of this answer to the question How can I simply prove that the pearson correlation coefficient is between -1 and 1? The only difference I can see between that question and this one is that the other question gives the Pearson coefficient in the form of Equation $(1)$.