How do I obtain the "final" standard deviation from a series of values containing individual (also null) SD values?

162 Views Asked by At

I have a list of values, say:

AVGprice SDprice count year
10          3      5   1999
20          2      3   2000
30          8      20  2001 
40        undef    1   2002

where AVGprice is the average of a certain price and SD its respective standard deviation for that particular year. Count represents the number of occurrences.

What's the best way to calculate the final AVGprice and SDprice for the whole period 1999-2002? For the former, I'd say a weighted average will do, but I am not sure about the final standard deviation - specially because of the undefined value.

1

There are 1 best solutions below

2
On BEST ANSWER

Let's work it out for the case of splicing together two data sets of size $n_x,n_y$. This generalizes easily enough. You have the averages $\overline{x},\overline{y}$ and the standard deviations $s_x,s_y$. The overall average is easy to compute: as you noticed it is just $\frac{n_x \overline{x}+n_y \overline{y}}{n_x+n_y}$.

For the standard deviations, we may write them as

$$s_x=\left ( \frac{1}{n_x-1} \left ( \sum_{i=1}^{n_x} x_i^2 - n_x \overline{x}^2 \right ) \right )^{1/2}$$

and similar for $s_y$. You can rearrange this to solve for $\sum_{i=1}^{n_x} x_i^2$:

$$\sum_{i=1}^{n_x} x_i^2 = (n_x-1) s_x^2 + n_x \overline{x}^2.$$

Now let's concatenate the data sets into a single one $z$ of size $n_x+n_y$. We want to get

$$s_z=\left ( \frac{1}{n_z-1} \left ( \sum_{i=1}^{n_z} z_i^2 - n_z \overline{z}^2 \right ) \right )^{1/2}.$$

We know $\overline{z}$ and $n_z$ already. For the sum of the squares, we can combine the two sums of squares that we already have to get

$$\sum_{i=1}^{n_x+n_y} z_i^2 = \sum_{i=1}^{n_x} x_i^2+\sum_{i=1}^{n_y} y_i^2=(n_x-1)s_x^2+n_x\overline{x}^2+(n_y-1)s_y^2+n_y\overline{y}^2.$$

So:

$$s_z = \left ( \frac{1}{n_z-1} \left ( (n_x-1) s_x^2+n_x\overline{x}^2+(n_y-1)s_y^2+n_y\overline{y}^2 - n_z \overline{z}^2 \right ) \right )^{1/2}.$$

This formula can be written in a slightly more revealing (but less efficient) way:

$$s_z = \left ( \frac{1}{n_z-1} \left ( (n_x-1) s_x^2+(n_y-1)s_y^2+n_x \left ( \overline{x}^2-\overline{z}^2 \right )+n_y \left ( \overline{y}^2-\overline{z}^2 \right ) \right ) \right )^{1/2}.$$

This says that the standard deviation of the overall data set comes from variations "within" the individual data sets, as well as from discrepancies between the mean of the overall data set and the means of the individual data sets.

As for the issue of one of your data sets having just one point in it, work out what it does to $\sum_i z_i^2$ and $\overline{z}$ and proceed accordingly.