Correct way to evaluate correlation of a computer model with multiple human annotator scores

86 Views Asked by At

I have posted this question to CrossValidated without lack. If anyone from this community can give some insights, I would be really grateful. Assume we have 3 annotators, each one of which has assessed the quality of 3 products in a scale from 1 to 7.

ANN  PRODUCT  SCORE
an1  pr1      5
an1  pr2      2
an1  pr3      3
an2  pr1      7
an2  pr2      1
an2  pr3      2
an3  pr1      3
an3  pr2      3
an3  pr3      4

We also have a computer model that makes predictions for the same products using a number of features.

pr1  0.70
pr2  0.25
pr3  0.35

There are two ways to calculate the correlation of model's scores with human scores.

  1. First average the human scores, and then get the correlation with model's scores

    PRODUCT  ANN.SCORE  MODEL SCORE
    pr1      (5+7+3)/3  0.70
    pr2      (2+1+3)/3  0.25
    pr3      (3+2+4)/3  0.35
    
  2. Repeat the model's score for every annotator and product, as follows:

    ANN  PRODUCT  ANN.SCORE  MODEL SCORE
    an1  pr1      5          0.70
    an1  pr2      2          0.25
    an1  pr3      3          0.35
    an2  pr1      7          0.70
    an2  pr2      1          0.25
    an2  pr3      2          0.35
    an3  pr1      3          0.70
    an3  pr2      3          0.25
    an3  pr3      4          0.35
    

    and then get the correlation.

My question is, which method makes more sense from a statistical point of view? What are the actual differences between the two ways of measuring the correlation? Thank you in advance!

1

There are 1 best solutions below

2
On

The first approach makes more sense from a statistical point of view. The concept of correlation typically refers to the relationship between two sets of independent data, which in this case are the model score and the human score. Averaging the values given by the three annotators for each product gives an estimate of human score that can be directly related to the model score, minimizing the confounding effect of between-annotators variability. Although in this case each product has the same number of observations, determining the mean of multiple measurements within each item is a classical procedure that is commonly used for the calculation of "weighted" correlation, particularly when the number of observation for each item is different. However, it should also be noted that the meaning of a correlation coefficient calculated over a small number of observations - as in this case - is rather questionable.

The second approach is more controversial. For studies of correlation in repeated-measure desings, usually two variables (for which I am searching the correlation) are both measured in multiple observations. In this case, only the human score measurement was repeated multiple times. Also, the resulting correlation coefficient would be confounded by the effect of human score variability.

Lastly, it would be also important to know whether the model scores are fixed or is there any variability even in them. Some computer models include a certain degree of variability, and in this case this should be considered as important as human score variability.