Pearson correlation

95 Views Asked by At

Exercise

Hey guys,

as you can see in the image, I have a table and some tasks given. I have finished the first two but I have difficulties with the third one. I understand the formula of the Pearson correlation, but I do not know how to correctly apply it to this task.

My idea was to take the average rating of each user and then take each movie from the user and calculate the correlation.

For the first two it would be: ((5-2.4)(3-1.6)+(1-2.4)(0-1.6)+...)

Is my idea correct? Any help is appreciated.

1

There are 1 best solutions below

2
On BEST ANSWER

From Wikipedia:
Pearson's correlation coefficient when applied to a sample is [...] defined as

$$r_{xy} =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}}.$$

Let $T$, $F$, $A$, $L$ denote Tony, ..., Leia respectively.

Case 1. We treat the missing values as $0$s.
Then, by some calculations (feel free to check them), we find that the highest correlation (between distinct samples) is between $T$ and $F$, with $r_{TF}=r_{FT}\approx 0.601$. The smallest correlation coefficient is $r_{FL}=r_{LF}\approx -0.880$.

Case 2. We ignore all rows that include missing values.
Then we have the strange situation that $F$ and $L$ are identical samples (that consist only of the value $3$). So $r_{FL}$ is not well-defined. Out of the other correlation coefficients, the maximum one is $r_{TL}=r_{LT}=0$ and the minimum one is $r_{TF}=r_{FT}=r_{FA}=r_{AF}=-1$.

Seeing that the results in case 2 are degenerated, I guess that they wanted you to go with case 1 (but that depends on your TA).