Finding the standard deviation of a probability distribution.

3k Views Asked by At

Here is the question:

The time, to the nearest whole minute, that a city bus takes to go from one end of its route to the other has the probability distribution shown. As sometimes happens with probabilities computed as empirical relative frequencies, probabilities in the table add up only to a value other than $1.00$ because of round-off error. $$ \begin{array}{c|cccccc} x & 42 & 43 & 44 & 45 & 46 & 47 \\ \hline P(x) & 0.10 & 0.23 & 0.34 & 0.25 & 0.05 & 0.02 \\ \end{array} $$ a. Find the average time the bus takes to drive the length of its route.
b. Find the standard deviation of the length of time the bus takes to drive the length of its route.

(Original image here.)

I did the first part and got $E(X)=43.54$, which is the correct answer. However, for the second part, I use the formula $\sigma = \sqrt{(\sum x^{2}P(x))-E(X)^{2}}$ and get approximately $4.517$. The answer is $1.204$. Where did I go wrong?

2

There are 2 best solutions below

0
On BEST ANSWER

You are dealing with a slight inaccuracy due to rounding. By definition, the mean is $\mu = \sum_{i=1}^5 p_ix_i$ and the variance is $\sigma^2 = \sum_{i=1}^5p_i(x_i - \mu)^2.$ By a formula, derived from the definition, $$\sigma^2 = E(X^2) - \mu^2 = \sum_{i=2}^5p_ix_i^2 - \mu^2.$$ However, the formula is very sensitive to round-off error.

In R, the mean can be computed as follows:

p = c(.1,.23,.34,.25,.05,.02); x = 42:47;  mu = sum(p*x);  mu
[1] 43.54

This agrees with what you found.

According to the definition, the variance and standard deviation are

sum(p*(x - mu)^2)
[1] 1.451084
sg = sqrt(sum(p*(x - mu)^2));  sg
[1] 1.204609

But the formula (exaggerating the errors) gives the standard deviation as

sqrt(sum(p*x^2) - mu^2)
[1] 4.517566

I don't know what you are supposed to show as the solution to this problem. However, to make sense of it, I think the logical course of action is to adjust the probabilities so that they add to 1:

sum(p)
[1] 0.99
p1 = p/sum(p); p1; sum(p1)
[1] 0.10101010 0.23232323 0.34343434 0.25252525 0.05050505 0.02020202  # adj probs
[1] 1                                                                  # sum to 1

Then use adjusted probabilities from the start to get the true mean and standard deviation (where both the definition and formula agree):

mu1 = sum(p1*x); mu1; sqrt(sum(p1*(x - mu1)^2));  sqrt(sum(p1*x^2) - mu1^2)
[1] 43.9798
[1] 1.127971
[1] 1.127971
4
On

There is a nasty trick, lying in the remark "As sometimes happens...".

The variance is indeed given by

$$V(X)=\sum_i p_i(x_i-\mu)^2=\sum_i p_ix_i^2-\mu^2$$

With $\mu=\sum_i p_ix_i$. And the standard deviation is the square root of the variance. But this equality only holds if $\sum_i p_i=1$.

So what happened? Do again the computation with the last probability being $0.03$ instead of $0.02$, to make the probabilities sum to $1$. Both formulas yield a variance equal to $1.3499$.

Redo the computation with last probability $0.02$: the first formula yields a variance $1.451084$, the other formula yields the value $20.4084$. What happens is the weights do not sum to $1$.

Notice that the first formula yields a standard deviation $\sqrt{1.451084}\simeq1.20460948$.

What would be best? I suggest this: consider the $p$ as "general" weights (that is, not summing to $1$, since they don't anyway) and compute the mean and variance accordingly. Equivalently, reweight by dividing the $p_i$ by the sum. The standard deviation is then $1.127971255$.

Note: even if your teacher is expecting you to use blindly the first formula, the correct approximation is $1.205$, not $1.204$. But since there is a bias in the mean (too low by roughly $0.01\times44$, considering the missing "mass" $0.01$ is somewhere between $42$ and $47$), thus also in the final result, I would not recommend this.

Another note: the exercise showed you that the first formula is more immune to numerical errors (the standard deviation returned is closer to any sensible value you might consider). You should always use this formula, and not the other one.