Correlation coefficient calculation

Question

Correlation coefficient calculation

1.7k Views Asked by Bumbble Comm At 27 Mar 2026 - 7:47

Why do we remove of the mean of the data while calculating the correlation coefficient value of bivariate data in statistics? DotProduct/ProductOfLengthOfVectors should always give anyway a coefficient that is between -1 and 1. What does removal of the mean achieve?

Original Q&A

There are 2 best solutions below

**Bumbble Comm** · Answer 1 · 2014-06-19 22:53:30

Consider the following datapoints $(X,Y)$: $(102,98)$, $(101,99)$, $(100,100)$, $(99,101)$, $(98,102)$.

Clearly, $Y = 100 - X$ for all these datapoints, so $Y$ is negatively related to $X$.

The correlation (when we do subtract the mean of the data first) is $-1$. If we compute the correlation without removing the mean, we get $\dfrac{49990}{50010} \approx 0.9996$.

We still get a number between $-1$ and $1$, but the result doesn't quite tell us that $X$ and $Y$ are negatively related. This is why we subtract out the mean first.

EDIT: To better answer your question, by defining $\rho(X,Y) = \tfrac{\vec{X}-\mu_X}{||\vec{X}-\mu_X||} \cdot \tfrac{\vec{Y}-\mu_Y}{||\vec{Y}-\mu_Y||}$, the correlation is scale and shift invariant. This means by replacing $X$ with $aX+b$ and $Y$ with $cY+d$ will not change the correlation. So, if you measure your data on a different scale (like measuring temperature data using Celsius instead of Kelvin), the correlation does not change. If you do not subtract our the mean, then the correlation coefficient would not have this property.

**Bumbble Comm** · Answer 2 · 2014-06-20 01:33:56

Suppose we have finite samples $\{x_1,x_2,\ldots,x_n\}$ and $\{y_1,y_2,\ldots,y_n\}$ from two distributions with sample means:

$$\bar{X}=\frac1{n}\sum_{i=1}^{n}x_i, \\\ \bar{Y}=\frac1{n}\sum_{i=1}^{n}y_i,$$

and sample variances

$$S_X^2=\frac1{n}\sum_{i=1}^{n}(x_i-\bar{X})^2, \\\ S_Y^2=\frac1{n}\sum_{i=1}^{n}(y_i-\bar{Y})^2,$$

Normally we subtract means to calculate the correlation as

$$\rho=\frac{\frac1{n}\sum_{i=1}^{n}(x_i-\bar{X})(y_i-\bar{Y})}{S_XS_Y}. $$

The Cauchy-Schwarz inequality can be applied to show that $|\rho| \leq 1.$

However, if we do not subtract means, then the sample correlation estimate will also fall between $-1$ and $1$ -- as long as we are consistent in not subtracting the means for the estimates of variance. In this case, the Cauchy-Schwarz inequality shows that

$$\left|\sum_{i=1}^{n}x_iy_i\right|\leq \sqrt{\sum_{i=1}^{n}x_i^2}\sqrt{\sum_{i=1}^{n}y_i^2},$$

and

$$\frac{\sum_{i=1}^{n}x_iy_i}{\sqrt{\sum_{i=1}^{n}x_i^2}\sqrt{\sum_{i=1}^{n}y_i^2}}\leq 1.$$

Either way, as long as the treatment of the means is consistent, the estimated correlation will fall between $-1$ and $1$. However, the choice of subtracting the means may be relevant in terms of estimating the variances without bias.

Correlation coefficient calculation

There are 2 best solutions below

Related Questions in CORRELATION

Trending Questions

Popular # Hahtags

Popular Questions