$M$ samples are drawn from a discrete probability distribution $\bigl( p_i>0; \; \sum_{i=1}^{N} p_i \bigr)$. The number of times result number $i$ has occurred is given by $M_i$ (s.t. $\sum_{i=1}^N M_i = M$). The MLE probabilities are given by $\hat{p}_i = \frac{M_i}{M}$. The empirical entropy is given by $\hat{S}(\hat{p}) = -\sum_{i=1}^{N} \hat{p}_i \log(\hat{p}_i)$. What is the mean and the variance of $\hat{S}$?
Note: As requested in the comments, I will try to give context, which is not part of the question but may be helpful. This is not an exercise from the classes, so I am not sure if nice expressions for the mean and variance of $\hat{S}$ exist. I am interested in model-free data analysis. I am interested to check by trial and error if the naive sample generalized mutual information
$\hat{I}(X:Y:Z...) = -\hat{S}(XYZ...) + \hat{S}(X) + \hat{S}(Y) + \hat{S}(Z) + ...$
can be used in a coarse preliminary test $(\hat{I} > const(M,N))$ to check whether the multidimensional experimentally-obtained data has any structure at all, or is indistinguishable from a pile of 1D independent data sets. Naively, the estimator makes sense, because joint entropy is just a sum of 1D entropies if the variables are independent, and positive in all other cases. Surely, the estimator will suffer from finite amount of data, under-explored phase space, bias and other stuff. So as a first step in this direction, I would like to construct its variance, which should be trivial if I know the variance of $\hat{S}$