Approximate overall mean and stdev of datasets with varying length

28 Views Asked by At

Suppose I have a list of $n$ datasets $D_i = (d_{i1},\ldots,d_{i\ell_i})$ of which I know

  • the length $\ell_i$,
  • the mean $\mu_i = \frac{1}{\ell_i} \sum\limits_{j=1}^{\ell_i} d_j$,
  • the standard deviation $\sigma_i = \sqrt{\frac{1}{\ell_i} \sum\limits_{j=1}^{\ell_i} (d_j - \mu_i)^2}$,
  • the maximum $b_i$, and
  • the minimum $a_i$.

Let $D'$ be all datasets (hypothetically) merged together, so $$ D' = (d'_1, \ldots, d'_L) = (d_{11},\ldots,d_{1\ell_1}, d_{21}, \ldots,d_{2\ell_2}, \ldots, d_{n1},\ldots,d_{n\ell_n}), $$ where $L = \sum\limits_{i=1}^{n} \ell_i$.

I would like to compute the overall mean $\mu = \frac{1}{L} \sum\limits_{k=1}^{L} d'_k $ and overall standard deviation $\sigma = \sqrt{\frac{1}{L} \sum\limits_{k=1}^{L} (d'_k - \mu)^2}$, but since I can not acquire $D'$, this is not possible like stated in their definition and I figure it is also not possible in general to compute them from the data given at the top.

However, I would still like to approximate $\sigma$ and $\mu$ as best as possible with the data given at the top via something like $$ \mu \approx \hat{\mu} := \frac{1}{n} \sum\limits_{i=1}^{n} \mu_i. $$

I think, if all lengths $\ell_i$ were the same, it would be $\mu = \hat{\mu}$, but in general, this is not the case.

So my question is: Is using the mean of $\mu_i$ and $\sigma_i$ a good approximation of $\mu$ and $\sigma$? Is there a good estimate to find out the errors $\vert\hat{\mu}-\mu\vert$ and $\vert\hat{\sigma}-\sigma\vert$? Are there better approximations?