Find StdDev of a percentage list

99 Views Asked by At

I need to find Standard Deviation of a percentage list, e.g. [0.5, 0.8, 0.8] in Python would give:

>>> import numpy
>>> a = numpy.array([0.5, 0.8, 0.8])
>>> numpy.mean(a)
0.70000000000000007
>>> numpy.std(a)
0.14142135623730953

My question is weather there is difference how the percentage list was created originally.

For example [1000/2000, 80/100, 8/10] will give the same percentages, but the 1000/2000=0.5 has more impact than 8/10. Or is there no difference?

Thank you.

2

There are 2 best solutions below

0
On BEST ANSWER

No, you don't add fractions by adding the numerators and denominators separately. And a scalar number has no "memory" of how it was computed. $1000/2000$, $1/2$, $0.00012/0.00006$ are all equivalent representations of the same number, $0.5$.

You are possibly thinking of a weighted average, where not all values are considered as accurate or reliable.

In this case, you would indeed compute $$m=\frac{2000\ 0.5+100\ 0.8+10\ 0.8}{2000+100+10},$$ assuming weights $2000$, $100$ and $10$. Just as if you had drawn $2000$ times $0.5$, $100$ times $0.8$ and again $10$ times $0.8$. Then you don't just have a list of values, but a list of values and a list of corresponding weights.

You will similarly compute the standard deviation using the weights, $$s^2=\frac{2000\ (0.5-m)^2+100\ (0.8-m)^2+10\ (0.8-m)^2}{2000+100+10}.$$

1
On

I think you need to be a little clearer about what your numbers represent. For example, say they represent the proportion of voters in different counties who voted "yes" on some issue.

$$X = \{\frac{2}{10}, \frac{10}{20}, \frac{40}{50}\}$$

and you want to know how the average county feels on this issue, then you might want to compute the average like this -

$$\frac{1}{3}\left( \frac{2}{10} + \frac{10}{20} + \frac{40}{50}\right) = 0.5$$

so the average county is unbiased. However, if you care about how the country feels as a whole, then you want to compute the average like this -

$$\frac{2 + 10 + 40}{10 + 20 + 50} = 0.65$$

so that the country as a whole has a preference for voting "yes".

In the second case, you accounted for the differing sample sizes, and in the first case you didn't. In the second case you cared about the average person (so it matters that there are a different number of people in each county) and in the first case you cared about the average county (so it doesn't matter that they contain a different number of people).

Note that the second example is equivalent to doing

$$\frac{1}{10 + 20 + 50}\left(\frac{2}{10}\times 10 + \frac{10}{20}\times 20 + \frac{40}{50}\times 50\right) = 0.65$$

i.e. it is a weighted average, where the weights are the denominators of the fractions, which represent the total sample size (number of people in each county, in this case).

How you compute your averages and standard deviations will depend on what your data represents, and what you want to do with it.