How to compare dispersion of data?

5.1k Views Asked by At

From my statistic book, I learn that standard deviation is used to estimate how much the data spread around the mean value. If I have two or more sets of data, for each one, they normalized to the same number (I mean all data add up to the same value). Is it a good way to compare how much the data spread-out around the center by calculating the standard deviation? For example, I have the following two data set

A = [0 0 0 0 0 15 35 15 0 0 0 0 0]; 
B = [6 5 8 6 5 4 3 4 5 8 5 6];

The plot of them look like

enter image description here

enter image description here

Now I estimate the standard deviation separately on those two data, I got

STD(A) = 10.6066
STD(B) = 1.5050

But that just against what I though that A should has less STD for it is not widely spread as B does. So my question is if STD only works for normal distribution or it could be used for any data? Secondly, if it works for any samples, why A will have STD higher than B?

p.s. for second question, I understand why it gives higher value for A from the definition of STD but I wonder why it should go like that because it is pretty localized.

1

There are 1 best solutions below

3
On BEST ANSWER

Bayes Theorem can help you out of this dilemma, but we do not even have to go that far to understand what is going on. You can calculate the standard deviation for any two datasets and compare them. But it depends on what you want to know and what your assumptions are about the data. For example if you want to know what you can measure more exactly - the size of a human or the size of a human cell, you would copare the relative standard deviation. If you want to know what measures lengths more exactly, a microscope or a measuring stick, you compare the absolute standard deviation.

In a related matter it depends on how you interpret your dataset, the way you ask your second question and the way you drew the datasets suggests you think of them as histograms. Then you have to estimate the standard deviation and the mean in a different way and you get the answer you were expecting. If you interpret the data as a list of points then obviously dataset A has the higher std as 15 is far away from 35 and 5, 6, 8 etc. are not that far away from each other.