Can I work out the variance in batches?

1.3k Views Asked by At

So I have a data divided into chunks, and I can only calculate the variance in each of the chunks because of software limitations. But I want to get the variance of the whole data together, not the chunks. I know the variance is not a linear operator. I would like the get kind of the average of the variance but this will have to be the same number as If I calculated the variance of the whole data together. Example: Rolling a dice in 3 groups of 2 rolls I can calculate the variance on each of the groups, so I with this data, I want to calculate the variance of the whole set: rolling a dice 6 times. Thank you for your help.

1

There are 1 best solutions below

5
On BEST ANSWER

Refer to the following answer to this question: How do I combine standard deviations of two groups?

In particular, the final formula

$$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}$$

illustrates how to compute the total variance of two samples, one of size $n$, sample mean $\bar x$, and sample variance $s_x^2$, and one of size $m$, sample mean $\bar y$, and sample variance $s_y^2$. Those are the quantities you need to track. Also note that the total sample mean is given by the formula $$\bar z = \frac{n \bar x + m \bar y}{n + m}.$$ These formulas readily lend themselves to an extended calculation for any number of groups:

  1. Set $i = 1$.
  2. Compute $n_i$, $\bar x_i$, and $s_{x_i}^2$, the sample size, sample mean, and sample variance of dataset $i$.
  3. Increment $i$.
  4. Repeat Step 2.
  5. Use the above two formulas to compute a new $n_T$, $\bar x_T$, and $s_T^2$ representing the sample size, sample mean, and sample variance of all datasets up to set $i$.
  6. If the last dataset was used to compute the result in step 5, stop. Otherwise, go to step 3.

Since the original poster has claimed that the formula does not work, I will furnish a numerical example to illustrate. This example will employ discrete data to match the scenario described in the question, but realizations from a continuous distribution can just as easily be provided.

Let $D_i$ represent dataset $i$. Then

$$\begin{align*} D_1 &= \{1, 1, 3, 4, 1, 5, 6, 3, 5, 5\} \\ D_2 &= \{5, 6, 2, 4, 2, 1, 1, 4, 2, 4, 4, 1, 3, 5, 6\} \\ D_3 &= \{3, 2, 6, 4, 1, 5, 2, 1, 3, 1, 5, 2, 2\} \\ D_4 &= \{5, 3, 1, 5, 1\} \end{align*}$$

Consequently, $$\begin{array}{|c|c|c|c|} \hline i & n_i & \bar x_i & s_{x_i}^2 \\ \hline 1 & 10 & \frac{17}{5} & \frac{18}{5} \\ \hline 2 & 15 & \frac{10}{3} & \frac{65}{21} \\ \hline 3 & 13 & \frac{37}{13} & \frac{73}{26} \\ \hline 4 & 5 & 3 & 4 \\ \hline \end{array}$$

We now calculate the combined sample sizes, means, and variances of datasets $1$ through $i$:

$$\begin{array}{|c|c|c|c|} \hline T & n_T & \bar x_T & \bar s_T^2 \\ \hline 1 & 10 & \frac{17}{5} & \frac{18}{5} \\ \hline 2 & 25 & \frac{84}{25} & \frac{947}{300} \\ \hline 3 & 38 & \frac{121}{38} & \frac{4245}{1406} \\ \hline 4 & 43 & \frac{136}{43} & \frac{2749}{903} \\ \hline \end{array}$$

The last row represents the total sample size, sample mean, and sample variance for the $4$ combined datasets.

Here is a sample calculation of the aggregate variance of datasets $1$ through $3$:

$$s_T^2 (T = 3) = \frac{(25 - 1)(\frac{947}{300}) + (13 - 1)(\frac{73}{26})}{25 + 13 - 1} + \frac{(25)(13)(\frac{84}{25} - \frac{37}{13})^2}{(25 + 13)(25 + 13 - 1)} = \frac{4245}{1406},$$

which matches the direct calculation based on datasets $D_1, D_2, D_3$.

Finally, Mathematica code to replicate the above computations:

d1 = {1, 1, 3, 4, 1, 5, 6, 3, 5, 5};
d2 = {5, 6, 2, 4, 2, 1, 1, 4, 2, 4, 4, 1, 3, 5, 6};
d3 = {3, 2, 6, 4, 1, 5, 2, 1, 3, 1, 5, 2, 2};
d4 = {5, 3, 1, 5, 1};

stat[x_] := {Length[x], Mean[x], Variance[x]}
data = stat /@ {d1, d2, d3, d4}
var[{n_, x_, sx_}, {m_, y_, sy_}] := {n + m, (n x + m y)/(n + m),
     ((n - 1) sx + (m - 1) sy)/(n + m - 1) + n m (x - y)^2/((n + m) (n + m - 1))}

Rest@FoldList[var[#1, #2] &, {0, 0, 0}, data]

stat[Join[d1, d2, d3, d4]]

In the future, rather than simply asserting that the formula doesn't work, it would be more polite and instructive to provide your own computations showing where you are encountering problems, so that your error can be found.