When performing feature selection by finding mutual information estimates for class C and feature U (both binary), we need to estimate joint probabilities like P(C=1, U=1). This site claims that the maximum likelihood estimate of this probability is
(# of documents where C=1 and U=1) / (# total documents)
Why is this the maximum likelihood estimate? Is this because we assume variable C and variable U are both Bernoulli, and thus the joint distribution of C and U is categorical? Or are we independently assuming C and U are Bernoulli and C-U together are categorical? I know that the MLE for a categorical distribution is (# events in category k / total events).