Iterative calculation of mean and standard deviation

14.2k Views Asked by At

I'm looking for a formula, to iteratively calculate the mean and standard deviation of a huge list of data points.

I found some examples here (formula 15 f.) and here, but both seem to be falling for my very simple testcase [10,100].

Source 1 states:

$M_1 = x_1$

$S_1 = 0$

$M_k = M_{k-1}+(x_k-M_{k-1})/k$

as well as

$S_k = S_{k-1}+(x_k-M_{k-1})*(x_k-M_k)$

with

$\sigma = \sqrt{S_n/(n-1)}$

This leads me to $M_1 = 10, S_1 = 0$ and $M_2 = 10+(100-10)/2 = 55$ but $S_2 = 0+(100-10)*(100-55) = 4050$ and therefore with $n=2$ to $\sigma \approx 63.6396$. The correct value is $45$, which I get, when I plug in $n = 3$ in the formula for $\sigma$.

Do I understand the formula wrong?

Source 2:

$M_{n+1}=M_n+x_{n+1}$

$S_{n+1}=S_n+\frac{(n*x_{n+1}−M_n)^2}{n(n+1)}$

with the mean given by

$\bar{x}_n= \frac{M_n}{n}$

and the unbiased estimate of the variance is given by

$\sigma_n^2=\frac{S_n}{n+1}$

which leads me to

$M_1 = 10, M_2 = 110, S_1 = 0$

$S_2 = 0+\frac{(2*100-10)^{2}}{2(2+1)} = 6016.6667$

however, if I plug in $n=1$ again this is correct. I feel, that my understanding of indexes is wrong, but why?

3

There are 3 best solutions below

1
On BEST ANSWER

You already noticed that using n + 1 (in your example: 3) in the first formula gives the correct answer while using n (in your example: 2) does not.

We can write this differently: the recursions for $M_n$ and $S_n$ are completely correct, but rather than computing the variance as $S_n/(n-1)$ you want to compute the variance as $S_n/n$.

This is a well known phenomenon:

$S_n/n$ is indeed the variance of the data points you have and hence the answer to your question.

$S_n/(n-1)$ is the best estimate of the (unknown) true variance of the larger underlying population you were drawing your random sample from.

This is quite counter-intuitive, but luckily there is a Wikipedia page dedicated to the issue: https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

0
On

The formula that you need is about half way down the Wikipedia page on the standard deviation in the section Identities & mathematical properties.

Personally in computer code I would calculate three quantities: \begin{equation*} n = \sum_{i=1}^{n} 1 \qquad\qquad S_1 = \sum_{i=1}^{n} x_i \qquad\qquad S_2 = \sum_{i=1}^{n} x_i^2 \end{equation*}

It is obvious how to iterate these. Then the mean & standard deviation are easily calculated as follows: \begin{equation*} \mu_n = \frac{S_1}{n} \qquad\qquad \sigma_n = \sqrt{\frac{S_2}{n}-\biggl(\frac{S_1}{n}\biggr)^2} \end{equation*}

It is this final formula that is in Wikipedia & I can never seem to remember! but is easy to derive from scratch.

3
On

The computationally cheapest way to do this, and also the optimal choice if you want to calculate standard deviations "on the fly" (=for each added point again), is to first calculate a running total of the values and a running total of the squared values. Finally you can calculate the standard deviation at each point (or only for the last point) using this identity

$$\sqrt{\frac{1}{N}\sum_{i=1}^N{(x_i-\overline{x})}^2} = \sqrt{\frac{1}{N}\sum_{i=1}^N{x_i^2}-{\Biggl(\frac{1}{N}\sum_{i=1}^N{x_i}\Biggr)^2}} $$

or in pseudo programming code:

$$ sdev = sqrt( v2sum /N - sqr(vsum/N) ) $$

where $vsum$ is the total of all values, and $v2sum$ is the total of all $N$ values squared (individually). As said, very useful if you want to know the standard deviations on the fly for each new point, as the the calculation time will stay linear (instead of growing with N², when using the standard formula).