test if observed distribution of labels is significantly different from background

23 Views Asked by At

Apologies if this is a very basic question, but I'm finding it hard to answer without a bit of help.

I have a set of labels (n=7162) that are classified in different categories (n=30). This is the background distribution of labels and looks like this:

background distribution of labels in classes

Then I have a sample in which not all the labels (and classes) are observed (numbers on top of each bar indicate the percentage of observed labels wrt the background labels in the class, absent numbers means all the labels for the class have been observed -- ie, 100%):

enter image description here

What I would like to understand is:

  1. is the distribution of observed labels per class significantly different from the background distribution of labels in classes? (ie, is the sample biased towards or against some classes)
  2. is any class significantly under- or over-represented in the sample? (for example, class 'X' contains 2275 possible labels, but only 1769 (77.4%) were observed, is that significant? what about class 'F' that contained only 2 possible labels but none were observed?).

This is the data used:

category,background,sample
A,7,7.0
B,383,318.0
C,53,53.0
D,19,18.0
E,26,26.0
F,2,0.0
G,234,231.0
H,94,87.0
I,4,4.0
J,76,76.0
K,180,175.0
L,177,177.0
M,10,10.0
N,553,538.0
O,1171,1082.0
P,252,210.0
Q,48,36.0
R,130,130.0
S,79,76.0
T,428,384.0
U,1,1.0
V,6,6.0
W,12,6.0
X,2275,1760.0
Y,510,504.0
Z,11,9.0
AA,207,202.0
AB,7,3.0
AC,24,24.0
AD,183,178.0

Total labels in categories: 7162

Observed labels in categories: 6331