I have posted this question to CrossValidated without lack. If anyone from this community can give some insights, I would be really grateful. Assume we have 3 annotators, each one of which has assessed the quality of 3 products in a scale from 1 to 7.
ANN PRODUCT SCORE
an1 pr1 5
an1 pr2 2
an1 pr3 3
an2 pr1 7
an2 pr2 1
an2 pr3 2
an3 pr1 3
an3 pr2 3
an3 pr3 4
We also have a computer model that makes predictions for the same products using a number of features.
pr1 0.70
pr2 0.25
pr3 0.35
There are two ways to calculate the correlation of model's scores with human scores.
First average the human scores, and then get the correlation with model's scores
PRODUCT ANN.SCORE MODEL SCORE pr1 (5+7+3)/3 0.70 pr2 (2+1+3)/3 0.25 pr3 (3+2+4)/3 0.35Repeat the model's score for every annotator and product, as follows:
ANN PRODUCT ANN.SCORE MODEL SCORE an1 pr1 5 0.70 an1 pr2 2 0.25 an1 pr3 3 0.35 an2 pr1 7 0.70 an2 pr2 1 0.25 an2 pr3 2 0.35 an3 pr1 3 0.70 an3 pr2 3 0.25 an3 pr3 4 0.35and then get the correlation.
My question is, which method makes more sense from a statistical point of view? What are the actual differences between the two ways of measuring the correlation? Thank you in advance!
The first approach makes more sense from a statistical point of view. The concept of correlation typically refers to the relationship between two sets of independent data, which in this case are the model score and the human score. Averaging the values given by the three annotators for each product gives an estimate of human score that can be directly related to the model score, minimizing the confounding effect of between-annotators variability. Although in this case each product has the same number of observations, determining the mean of multiple measurements within each item is a classical procedure that is commonly used for the calculation of "weighted" correlation, particularly when the number of observation for each item is different. However, it should also be noted that the meaning of a correlation coefficient calculated over a small number of observations - as in this case - is rather questionable.
The second approach is more controversial. For studies of correlation in repeated-measure desings, usually two variables (for which I am searching the correlation) are both measured in multiple observations. In this case, only the human score measurement was repeated multiple times. Also, the resulting correlation coefficient would be confounded by the effect of human score variability.
Lastly, it would be also important to know whether the model scores are fixed or is there any variability even in them. Some computer models include a certain degree of variability, and in this case this should be considered as important as human score variability.