Are the mean and variance of a set equals the sum of its means and variaces

104 Views Asked by At

I'm not so fit in statistics and I found some controverse answers on the internet so I'm asking here.

I have a set $A$ with 10439 samples. The set is not of unique values and many of the sample values occur more than once. Now I want to calculate the mean and the variance of the set, but there is a limitation. I can't calculate for the whole set at once so I had to split it into $N$ batches. Now lets say I have $N=73$ batches $A_{1}, ... ,A_{73}$ created form the big set $A$. Do those hold?

  1. $mean(A) = \frac{mean(A_{1}) + ... + mean(A_{73})}{N}$
  2. $var(A) = \frac{var(A_{1}) + ... + var(A_{73})}{N}$

If not then how can I achieve my goal using the batches?

1

There are 1 best solutions below

3
On BEST ANSWER

Statement 1 and Statement 2 are both untrue.

Statement 1 is not true because the means in the average are not weighted by the size of the set. For example if $A_1 = \{1\}$ and $A_2 = \{2,2,2,2\}$ then $\operatorname{mean}(A_1 \cup A_2) = 1.8$ but $\tfrac{\operatorname{mean}(A_1) + \operatorname{mean}(A_2)}{2} = \frac{1+2}{2} = 1.5$. The correct formula is $$ \operatorname{mean}(A) = \frac{|A_1|\operatorname{mean}(A_1) + |A_2|\operatorname{mean}(A_2) + \cdots + |A_N|\operatorname{mean}(A_N)}{|A_1|+|A_2|+\cdots+|A_N|} $$ where $|A_i|$ gives the number of elements in the set $A$. To see this, expand $$ \operatorname{mean}(A) = \frac{1}{|A|}\sum_{a\in A}a = \frac{1}{|A|}\biggl(\frac{|A_1|}{|A_1|}\sum_{a\in A_1}a + \frac{|A_2|}{|A_2|}\sum_{a\in A_2}a + \cdots + \frac{|A_2|}{|A_2|}\sum_{a\in A_N}a\biggr) $$ and observe that $\tfrac{1}{|A_i|}\sum_{a \in A_i} a$ is exactly $\operatorname{mean}(A_i)$.

Statement 2 is untrue for effectively the same reason. One way to compute the correct $\operatorname{var}(A)$ in batches is to compute $s_i = \sum_{a\in A_i}a^2$ for every batch $i=1,2,\ldots,N$. At this point $$\operatorname{var}(A) = \frac{s_1+s_2+\cdots+s_N}{|A_1| + |A_2| + \cdots + |A_N|} - \operatorname{mean}(A)^2,$$ where $\operatorname{mean}(A)$ was computed as above.