I work in a field that assess applications for benefit. We are interested in comparing outcomes between client groups to identify potential bias. We plan to anonymizing applications to remove factors that might identify the client as belonging to either client group and have decision-makers review them to compare outcomes. Anonymizing applications is resource intensive so we intend to perform the test on a a small sample (20 per client group). The idea was originally to have one decision-maker do the review and perform an unpaired T-test on the results. To correct for high variability among decision-makers, having multiple decision-makers review the same sample has been proposed. The lack of independence in one dimension of the sample seems to rule out both a T and Chi-Square test.
Is there a statistical in such a scenario? Thank you.