How can I simply prove that the pearson correlation coefficient is between -1 and 1?

70.3k Views Asked by At

For building a recommendation system, I also use the Pearson correlation coefficient. This is the definition:

$r(x, y)=\frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2 \cdot \sum_{i=1}^n (y_i-\bar{y})^2}}$

$x$ and $y$ are part of $\mathbb{R}$.

Now for coding, it is important to take care of all potential outcomes. For example, if the denominator is zero, you will have to filter that or throw an exception.

I came up with some arguments, one of them being that if all values of $x_i$ and/or $y_i$ were equal to the average of $x$ and/or $y$, then the denominator would be zero.

But how can I prove that the coefficient is either undefined (zero denominator) or in between -1 and 1? What is the best approach?

4

There are 4 best solutions below

3
On BEST ANSWER

First of all Pearson's correlation coefficient is bounded between -1 and 1, not 0 and one. It's absolute value is bounded between 0 and 1, and that useful later.

Pearson's correlation coefficient is simply this ratio:

$$\rho = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}$$

Both of the variances are non-negative by definition, so the denominator is $\ge 0$. The only way a singularity can occur is if one of the variables has 0 variance.

If two random variables are perfectly uncorrelated, (i.e. independent) then their covariance is 0. So 0 is a valid lower bound for the absolute value of the expression.

This can be shown like so:

$$Cov(X,Y) = E[(X-\bar{X})(Y-\bar{Y})] = E[XY] - E[X]E[Y]$$

if two random variables are independent, then $E[XY]=E[X]E[Y]$, and

$$Cov(X,Y) = E[XY] - E[X]E[Y] = E[X]E[Y] - E[X]E[Y] = 0.$$

Now for the upper bound. Here we apply the Cauchy-Schwarz inequality.

$$|Cov(X,Y)|^2 \le Var(X)Var(Y)$$

$$\therefore |Cov(X,Y)| \le \sqrt{Var(X)Var(Y)}$$

plug this result from the Cauchy-Schwarz inequality into the formula for $\rho$, and we get:

$$|\rho| = \left|\frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}\right| \le \frac{\sqrt{Var(X)Var(Y)}}{\sqrt{Var(X)Var(Y)}} = 1$$

Thus we have the absolute value of the correlation is bounded below by 0 and above by 1.

4
On

Here's a standalone proof that should be much easier to understand and justify than quoting Cauchy-Schwarz. It's taken from sec 7.4 of Sheldon Ross's A First Course in Probability 10th edition, which I highly recommend.

Let $X$ and $Y$ be random variables with respective variances $\mathrm{Var}(X) = \sigma_x^2$ and $\mathrm{Var}(Y) = \sigma_y^2$. We then have

$$ 0 \leq \mathrm{Var} \left( \frac{X}{\sigma_x} \pm \frac{Y}{\sigma_y} \right) = 2 \pm 2 \mathrm{Corr} (X,Y). $$

From which it immediately follows that $-1 \leq \mathrm{Corr} (X,Y) \leq 1$.

0
On

Saw this while looking around. To complete above proof, $$ 0\leq Var(\frac{X}{\sigma_x}\pm\frac{Y}{\sigma_y}) = Var(\frac{X}{\sigma_x}) + Var(\frac{Y}{\sigma_y}) \pm 2Cov(\frac{X}{\sigma_x},\frac{Y}{\sigma_y}) $$

where (for both $X$ and $Y$) due to $Var(aX)=a^2Var(X)$ $$Var(\frac{X}{\sigma_x}) = \frac{1}{\sigma_x^2}Var(X)=\frac{\sigma_x^2}{\sigma_x^2}=1$$

and due to $Cov(aX,bY)=abCov(X,Y)$ $$Cov(\frac{X}{\sigma_x},\frac{Y}{\sigma_y}) = \frac{1}{\sigma_x\sigma_y}Cov(X,Y) = Corr(X,Y)$$

Hence $$ Var(\frac{X}{\sigma_x}\pm\frac{Y}{\sigma_y}) = 1+1\pm2Corr(X,Y)$$

1
On

Let $\vec{a} = (x_1-\bar{x}, x_2-\bar{x}, \ldots, x_n-\bar{x})$ and $\vec{b}=(y_1-\bar{y},\ldots,y_n-\bar{y})$. Then your formula is just $\frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}$. Since $\vec{a} \cdot \vec{b} = \|\vec{a}\| \|\vec{b}\| \cos(\theta)$ your formula reduces to $\cos(\theta)$, where $\theta$ is the angle between $\vec{a}$ and $\vec{b}$ in $n$-dimensional euclidean space.