Apologies if this is a very basic question, but I'm finding it hard to answer without a bit of help.
I have a set of labels (n=7162) that are classified in different categories (n=30). This is the background distribution of labels and looks like this:
Then I have a sample in which not all the labels (and classes) are observed (numbers on top of each bar indicate the percentage of observed labels wrt the background labels in the class, absent numbers means all the labels for the class have been observed -- ie, 100%):
What I would like to understand is:
- is the distribution of observed labels per class significantly different from the background distribution of labels in classes? (ie, is the sample biased towards or against some classes)
- is any class significantly under- or over-represented in the sample? (for example, class 'X' contains 2275 possible labels, but only 1769 (77.4%) were observed, is that significant? what about class 'F' that contained only 2 possible labels but none were observed?).
This is the data used:
category,background,sample
A,7,7.0
B,383,318.0
C,53,53.0
D,19,18.0
E,26,26.0
F,2,0.0
G,234,231.0
H,94,87.0
I,4,4.0
J,76,76.0
K,180,175.0
L,177,177.0
M,10,10.0
N,553,538.0
O,1171,1082.0
P,252,210.0
Q,48,36.0
R,130,130.0
S,79,76.0
T,428,384.0
U,1,1.0
V,6,6.0
W,12,6.0
X,2275,1760.0
Y,510,504.0
Z,11,9.0
AA,207,202.0
AB,7,3.0
AC,24,24.0
AD,183,178.0
Total labels in categories: 7162
Observed labels in categories: 6331

