Correlation Coefficient

431 Views Asked by At

I am trying to understand the following equation for Correlation Coefficient:

$r = \frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{\sqrt(\sum_{i=1}^{n}(x_i - \bar x)^2\sum_{i=1}^{n}(y_i-y)^2)}$

Can someone dissect this equation and provide reasoning as to why this equation does what it does, producing $-1 \le r \le 1$ and showing the relationship between two variables? Or how this was derived?

Thank you as always.

1

There are 1 best solutions below

1
On BEST ANSWER

Let $\mathbf{z}$ be a vector in $n$-dimensional space (think in two-dimensions, $n=2$, if that is easier). In terms of a given coordinate system with unit vectors $(\mathbf{e}_1, \mathbf{e}_2, \dots, \mathbf{e}_n$), the vector can be expressed as the sum of its components $$ \mathbf{z} = z_1\mathbf{e}_1 + z_2\mathbf{e}_2 + \dots + z_n\mathbf{e}_n \,. $$ The magnitude (length) of the vector is $$ \lVert \mathbf{z} \rVert = \sqrt{\mathbf{z}\cdot\mathbf{z}} = \sqrt{z_1^2 + z_2^2 + \dots + z_n^2} \,. $$ Therefore, the unit length vector in the direction of $\mathbf{z}$ is $$ \hat{\mathbf{z}} = \frac{\mathbf{z}}{\lVert \mathbf{z} \rVert} $$ Now consider two such vectors $\mathbf{v}$ and $\mathbf{w}$. The unit vectors in the directions of these two vectors are $$ \hat{\mathbf{v}} = \frac{\mathbf{v}}{\lVert \mathbf{v} \rVert} \quad \text{and} \quad \hat{\mathbf{w}} = \frac{\mathbf{w}}{\lVert \mathbf{w} \rVert} $$ The inner product (dot product) of these unit vectors is $$ \hat{\mathbf{v}}\cdot\hat{\mathbf{w}} = \frac{\mathbf{v} \cdot \mathbf{w}}{\lVert \mathbf{v} \rVert \lVert \mathbf{w} \rVert} = \frac{v_1w_1 +v_2 w_2 + \dots + v_n w_n}{\sqrt{v_1^2+v_2^2+\dots+v_n^2}\sqrt{w_1^2+w_2^2+\dots+w_n^2}} = \cos\theta $$ where $\theta$ is the angle between these vectors. Also $\cos\theta$, by definition, must lie between $-1$ and $+1$.

In your case, $$ \mathbf{v} = \mathbf{x} - \bar{x}\mathbf{1} \quad \text{and} \quad \mathbf{w} = \mathbf{x} - \bar{y}\mathbf{1} $$ where $\mathbf{1} = \mathbf{e}_1 + \mathbf{e}_2 + \dots \mathbf{e}_n$. Therefore, $$ \hat{\mathbf{v}}\cdot\hat{\mathbf{w}} = \frac{(\mathbf{x} - \bar{x}\mathbf{1})\cdot (\mathbf{y}- \bar{y}\mathbf{1})}{\lVert (\mathbf{x} - \bar{x}\mathbf{1})\rVert \lVert \mathbf{y}- \bar{y}\mathbf{1} \rVert} := r $$ which implies that $r$ must lie between $-1$ and $1$. The shift centers the data makes sure that none of the components of the two vectors $\mathbf{v}$ and $\mathbf{w}$ is to far from the mean.

So all the calculation does is find the projection of the $x$ values on the $y$ values in a $n$-dimensional space.