As I am not a math geek so I have problem comprehending this equation:
equation for the correlation coefficient
Its basically the formula used in the CORREL functon in Microsoft Excel and I am trying to manually compute it for the sake of learning.
So lets say I have data1 which contains 3,2,4,5,6 and data2 which has 9,7,12,15,17
So the average of data1 is 4 and the average of data2 is 12. (that I know) but what does the X and Y means?
Is the x the sum of data1 and y is the sum of data2?
And what does the greek alphabet in the equation means?
Basically I just want someone who can explain it to me in plain english.
So I hope you dont mind explaining it to me like a kid :)
thanks in advance for your help.
source: https://support.office.com/en-us/article/CORREL-function-995dcef7-0c0a-4bed-a3fb-239d7b68ca92
Little $x$ is a list of data. Little $y$ is another list of data. In other words $x = (x_1,x_2,\dotsc,x_n)$ and $y = (y_1,y_2,\dotsc,y_n)$. The large capital $\Sigma$ (sigma) signifies a summation. I think you understand that $\bar x,\bar y$ represent the averages of the respective data. Thus, for completeness, your formula should really read \begin{align*} \text{Corr}(X,Y) &= \frac{\displaystyle{\sum_{i = 1}^n(x_i-\bar x)(y_i-\bar y)}}{\displaystyle{\sqrt{\sum_{i=1}^n(x_i-\bar x)^2\times\sum_{i=1}^n(y_i-\bar y)^2}}}\\ &=\frac{(x_1-\bar x)(y_1-\bar y)+\dotsb+(x_n-\bar x)(y_n-\bar y)}{\sqrt{[(x_1-\bar x)^2+\dotsb+(x_n-\bar x)^2][(y_1-\bar y)^2+\dotsb+(y_n-\bar y)^2}]} \end{align*}
Finally, using your example (and better notation), \begin{align*} \text{Corr}(x,y)&= \frac{(3-4)(9-12)+\dotsb+(6-4)(17-12)}{\sqrt{[(3-4)^2+\dotsb+(6-4)^2][(9-12)^2+\dotsb+(17-12)^2]}}\\ &=\frac{26}{\sqrt{10\cdot68}}\\ &=0.9970545 \end{align*}
In plain English, the numerator is the sum of the product of each deviation from the mean with respect to the list $x$ and the list $y$. The denominator is the square root of the sum of the square of the deviations from list $x$ times the sum of the square of the deviations from list $y$. What does it all mean? We're trying to measure the linear relationship between the to data lists. You can read more here.