Can we get negative variance when calculating it for a small dataset using a naive formula?

142 Views Asked by At

In Knuth's Volume 2 Seminumerical Algorithms, chapter 4.2.2 Accuracy of Floating Point Arithmetic, there's a statement:

Novice programmers who calculate the standard deviation of some observations by using the textbook formula $$\sigma=\sqrt{\left.\left(n\sum_{1\le k\le n}x^2_k-\left(\sum_{1\le k\le n}x_k\right)^2\right)\right/n(n-1)}\tag{14}$$ often find themselves taking the square root of a negative number!

Can this be demonstrated with some small dataset (some tens of points) and the common IEEE $\mathrm{binary64}$ arithmetic, without resorting to values differing by $\sim10$ orders of magnitude?

1

There are 1 best solutions below

3
On BEST ANSWER

If you just want to have an actual example, here it is: Set $n=3$, $x_1=x_2=x_3=1 + 2^{-52}.$ Then you have (the first values are the hexadecimal binary64 values)

x1 = x2 = x3                = 3FF0000000000001 =  1.00000000000000E+0000
x1^2 = x2^2 = x3^2          = 3FF0000000000002 =  1.00000000000000E+0000
a = 3*(x1^2 + x2^2 + x3^2)  = 4022000000000002 =  9.00000000000000E+0000
b = (x1 + x2 + x3)^2        = 4022000000000003 =  9.00000000000001E+0000
a - b                       = BCE0000000000000 = -1.77635683940025E-0015

Admittedly the $x_i$ are somewhat artificial. Another example with equal $x_i$ is $n=6, x_i=0.3$

xi              = 3FD3333333333333 =  3.00000000000000E-0001
xi^2            = 3FB70A3D70A3D70A =  9.00000000000000E-0002
a = 6*sum(xi^2) = 4009EB851EB851EA =  3.24000000000000E+0000
b = sum(xi)^2   = 4009EB851EB851EC =  3.24000000000000E+0000
a - b           = BCD0000000000000 = -8.88178419700125E-0016

And here the expanded second example with non-zero variance (the now $x_1$ is the smallest FP number greater than 0.3)

x1   = 3FD3333333333334 =  3.0000000000000004E-0001
x2   = 3FD3333333333333 =  2.9999999999999999E-0001
x3   = 3FD3333333333333 =  2.9999999999999999E-0001
x4   = 3FD3333333333333 =  2.9999999999999999E-0001
x5   = 3FD3333333333333 =  2.9999999999999999E-0001
x6   = 3FD3333333333333 =  2.9999999999999999E-0001
mean = 3FD3333333333333 =  2.9999999999999999E-0001
sdev = 3C7C9F25C5BFEDD9 =  2.4825341532472729E-0017
a    = 4009EB851EB851EA =  3.2399999999999993E+0000
b    = 4009EB851EB851EE =  3.2400000000000011E+0000
a-b  = BCE0000000000000 = -1.7763568394002505e-0015

Interestingly here the negative difference occurs already for $n=2$

x1   = 3FD3333333333334 =  3.0000000000000004E-0001
x2   = 3FD3333333333333 =  2.9999999999999999E-0001
mean = 3FD3333333333334 =  3.0000000000000004E-0001
sdev = 3C90000000000000 =  5.5511151231257827E-0017
a    = 3FD70A3D70A3D70B =  3.6000000000000004E-0001
b    = 3FD70A3D70A3D70C =  3.6000000000000010E-0001
a-b  = BC90000000000000 = -5.5511151231257827E-0017