Why do we remove of the mean of the data while calculating the correlation coefficient value of bivariate data in statistics? DotProduct/ProductOfLengthOfVectors should always give anyway a coefficient that is between -1 and 1. What does removal of the mean achieve?
Correlation coefficient calculation
1.7k Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 2 best solutions below
On
Suppose we have finite samples $\{x_1,x_2,\ldots,x_n\}$ and $\{y_1,y_2,\ldots,y_n\}$ from two distributions with sample means:
$$\bar{X}=\frac1{n}\sum_{i=1}^{n}x_i, \\\ \bar{Y}=\frac1{n}\sum_{i=1}^{n}y_i,$$
and sample variances
$$S_X^2=\frac1{n}\sum_{i=1}^{n}(x_i-\bar{X})^2, \\\ S_Y^2=\frac1{n}\sum_{i=1}^{n}(y_i-\bar{Y})^2,$$
Normally we subtract means to calculate the correlation as
$$\rho=\frac{\frac1{n}\sum_{i=1}^{n}(x_i-\bar{X})(y_i-\bar{Y})}{S_XS_Y}. $$
The Cauchy-Schwarz inequality can be applied to show that $|\rho| \leq 1.$
However, if we do not subtract means, then the sample correlation estimate will also fall between $-1$ and $1$ -- as long as we are consistent in not subtracting the means for the estimates of variance. In this case, the Cauchy-Schwarz inequality shows that
$$\left|\sum_{i=1}^{n}x_iy_i\right|\leq \sqrt{\sum_{i=1}^{n}x_i^2}\sqrt{\sum_{i=1}^{n}y_i^2},$$
and
$$\frac{\sum_{i=1}^{n}x_iy_i}{\sqrt{\sum_{i=1}^{n}x_i^2}\sqrt{\sum_{i=1}^{n}y_i^2}}\leq 1.$$
Either way, as long as the treatment of the means is consistent, the estimated correlation will fall between $-1$ and $1$. However, the choice of subtracting the means may be relevant in terms of estimating the variances without bias.
Consider the following datapoints $(X,Y)$: $(102,98)$, $(101,99)$, $(100,100)$, $(99,101)$, $(98,102)$.
Clearly, $Y = 100 - X$ for all these datapoints, so $Y$ is negatively related to $X$.
The correlation (when we do subtract the mean of the data first) is $-1$. If we compute the correlation without removing the mean, we get $\dfrac{49990}{50010} \approx 0.9996$.
We still get a number between $-1$ and $1$, but the result doesn't quite tell us that $X$ and $Y$ are negatively related. This is why we subtract out the mean first.
EDIT: To better answer your question, by defining $\rho(X,Y) = \tfrac{\vec{X}-\mu_X}{||\vec{X}-\mu_X||} \cdot \tfrac{\vec{Y}-\mu_Y}{||\vec{Y}-\mu_Y||}$, the correlation is scale and shift invariant. This means by replacing $X$ with $aX+b$ and $Y$ with $cY+d$ will not change the correlation. So, if you measure your data on a different scale (like measuring temperature data using Celsius instead of Kelvin), the correlation does not change. If you do not subtract our the mean, then the correlation coefficient would not have this property.