Meaning of denominator in correlation?

3.1k Views Asked by At

I can't quite grasp the meaning of the denominator in the correlation coefficient.

$$\frac{\sum(X - \bar X)(Y-\bar Y)}{\sqrt {\sum (X-\bar X)^2\sum(Y-\bar Y)^2}}$$

What exactly am I dividing with, and why?

I understood dividing with the standard deviation in the Z distribution, that got me the difference from the mean in terms of standard deviations. But what does this give?

The covariance measured in....what, the standard deviation of X times the standard deviation of Y?

That would explain where the n's in the denominator (of the covariance as well as that of the standard deviations) have gone, but what does that mean?

2

There are 2 best solutions below

3
On

Do you know the scalar/inner product of vectors on $\mathbb{R^3}$ or $\mathbb{R^n}$?

$\hat{u} \cdot \hat{v} = \|\hat{u}\|\|\hat{v}\|\cos\theta\ $ or

$$\cos\theta = \frac{\hat{u} \cdot \hat{v}}{\|\hat{u}\|\|\hat{v}\|} = \frac{\hat{u}}{\|\hat{u}\|}\cdot\frac{\hat{v}}{\|\hat{v}\|} = \frac{\sum_i u_i v_i}{\sqrt{\sum_i u_i^2 }\sqrt{\sum_i v_i^2}}$$

Correlation then is analogous to finding the angle between two vectors. The denominator is normalizing the vectors so that we are taking the scalar/inner product between two vectors of unit length.

2
On

A way to better understand the denominator is to note that the coefficient of correlation is the square root of the coefficient of determination $r^2=\frac{ss^2_{xy}}{ ss_{xx} ss_{yy}}$. Here, $ss_{xy}$ is the $xy$ covariance, $ss_{xx}$ is the $x$ variance, and $ss_{yy}$ is the $y$ variance.

Writing it in this way, we can interpret $r^2$ as the squared $xy$ covariance (i.e., a measure of how much $x$ and $y$ "change together", that is to say show a similar behaviour) "normalized" to the two single variances. Normalization is important for at least two reasons. First, as correctly noted in the comments, it allows to get an index of correlation that is not sensitive to linear transformations. Second, it provides an index that reflects the actual relationship between the two variables in a more reliable manner than the non-normalized covariance.

To better explain this last concept: let us imagine two random variables characterized by a relatively low dispersion of data, so that the quantities $X_i - \bar X$ and $Y_i-\bar Y$ are, on average, relatively small. In this case, the resulting covariance would probably be relatively small as well, even if these two variables are strictly related and show a "similar behaviour" (defined by the sign concordance or discordance between the pairs of values $X_i - \bar X$ and $Y_i-\bar Y$). So, if we based our judgement only on the covariance, we could erroneously conclude that the two variables do not show a good correlation.

The opposite situation is also true: if we take two random variables characterized by a relatively high dispersion of data and that are poorly related, the resulting covariance could be large as well, regardless of the weak relationship. Again, if we based our judgement only on the covariance, we could erroneously conclude that the two variables show a strong relationship.

Normalizing to the variances of the two variables allows to overcome this problem. In the first case, we would get a relatively high $r^2$, as a result of the division by small quantities $ss_{xx}$ and $ss_{yy}$): this gives to us a correct information, that the two variables are well correlated. Similarly, in the second case, we would get a relatively small $r^2$, as a result of the division by large quantities $ss_{xx}$ and $ss_{yy}$), again obtaining a correct information.

Lastly, the use of $r$ instead of $r^2$ in assessing correlation somewhat resembles the use of the SD instead of the variance to describe variability of data. However, remind that a typical property of the SD is that, unlike the variance, it is expressed in the same units as the data. This difference does not hold for $r$ and $r^2$, since they are both dimensionless.