How does the standard deviation change if you change one value?

81 Views Asked by At

Question: There are 23 employees in a particular division of a company. Their salaries have a mean of \$70,000, a median of \$55,000, and a standard deviation of \$20,000. The largest number on the list is \$100,000. By accident, this number is changed to \$1,000,000.

I am asked to calculate the new standard deviation, which is $$ \sqrt{ \frac{\displaystyle \sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{n-1} } $$

where $\bar{x}$ is the new mean, which is $$ \frac{23\cdot 70,000-100,000+1,000,000}{23}=109,130.43 $$

Plugging in $\bar{x}$, I need to calculate $$ \sqrt{ \frac{\displaystyle \sum_{i=1}^{n}(x_{i}-109,130.43)^{2}}{22} } $$

Here's where I'm having trouble. How can I calculate the new standard deviation without knowing each $x_{i}$?

It doesn't seem to help that I know the old standard deviation is $$ \sqrt{ \frac{\displaystyle \sum_{i=1}^{n}(x_{i}-70,000)^{2}}{22} }=20,000 $$ since I don't think the new standard deviation can be rewritten to include this expression.

3

There are 3 best solutions below

1
On BEST ANSWER

I have seen the solution and understood it. The equality $\displaystyle \sum_{i=1}^{n}(x_{i}-\bar{x})^{2}=\displaystyle \sum_{i=1}^{n}x_{i}^{2}-n\bar{x}^{2}$, credit to Andrei for pointing it out, seems to simplify things significantly.

Note: all numbers will be in thousands.

For the original set of $n=23$ numbers containing 100:

For convenience, we define the following $$T_{0}=\displaystyle \sum_{i=1}^{n}x_{i}^{2}$$
Let $\bar{x}_{0}=70,V_{0},S_{0}$ be the mean, variance, standard deviation. We have $$ V_{0}=S_{0}^{2}=400=\frac{1}{n-1}(T_{0}-n\bar{x}_{0}^{2}) $$ We get $T_{0}=121,500$ after solving for it.

For the changed set of $n=23$ numbers containing 1000:

$$T_{1}=T_{0}-100^{2}+1000^{2}=1,111,500$$ Let $\bar{x}_{1}=109.13043,V_{1},S_{1}$ be the new mean, variance, and standard deviation $$\displaylines{ V_{1}=S_{1}^{2}=\frac{1}{n-1}(T_{1}-n\bar{x}_{1}^{2}) }$$ After solving, we get that $S_{1}=195.12$ and we are done.

The new standard deviation is \$195,120.

0
On

Start from $$\sum_{i=1}^n(x_i-\bar x)^2=\sum_{i=1}^n(x_i^2-2x_i\bar x+(\bar x)^2)\\=\sum_{i=1}^nx_i^2-n(\bar x)^2$$ Proceed the same way as for mean. Can you take it from here?

2
On

Let $(x_1, \ldots, x_{n-1}, x_n)$ be the correct data, and let $(x_1, \ldots, x_{n-1}, x_n')$ be the incorrect data, so $x_n' \ne x_n$.

The sample mean of each data set is $$\bar x = \frac{1}{n} \sum_{i=1}^n x_i, \quad \bar x' = \frac{1}{n} \left(x_n' + \sum_{i=1}^{n-1} x_i \right). \tag{1}$$ So $$\bar x - \bar x' = \frac{1}{n} (x_n - x_n') \tag{2}$$ which allows us to calculate the correct sample mean from the incorrect sample mean.

The correct and incorrect sample variances are respectively $$s = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2, \quad s' = \frac{1}{n-1} \left( (x_n' - \bar x')^2 + \sum_{i=1}^{n-1} (x_i - \bar x')^2 \right). \tag{3}$$ We want to write the correct variance in terms of the incorrect one, so this suggests writing $$\begin{align} s &= \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x' + \bar x' - \bar x)^2 \\ &= \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x')^2 + 2(x_i - \bar x')(\bar x' - \bar x) + (\bar x' - \bar x)^2 \\ &= \frac{1}{n-1} \left( \sum_{i=1}^n (x_i - \bar x')^2 + 2(\bar x' - \bar x)\sum_{i=1}^n (x_i - \bar x') + n(\bar x' - \bar x)^2 \right) \\ &= \frac{1}{n-1} \left( \sum_{i=1}^n (x_i - \bar x')^2 + 2(\bar x' - \bar x)(n \bar x - n \bar x') + n(\bar x' - \bar x)^2 \right) \\ &= \frac{1}{n-1} \left( \sum_{i=1}^n (x_i - \bar x')^2 - n(\bar x' - \bar x)^2 \right) \\ &= \frac{1}{n-1} \left( (x_n' - \bar x')^2 + \sum_{i=1}^{n-1} (x_i - \bar x')^2 + (x_n - \bar x')^2 - (x_n' - \bar x')^2 - n(\bar x' - \bar x)^2 \right) \\ &= s' + \frac{(x_n - \bar x')^2 - (x_n' - \bar x')^2 - n(\bar x' - \bar x)^2}{n-1} \\ &= s' + \frac{(x_n - \bar x' + x_n' - \bar x')(x_n - \bar x' - x_n' + \bar x') - n(\bar x' - \bar x)^2}{n-1} \\ &= s' + \frac{(x_n + x_n' - 2\bar x')(x_n - x_n') - (x_n - x_n')^2/n}{n-1} \\ &= s' + \frac{x_n - x_n'}{n-1} \left( x_n + x_n' - 2\bar x' - \frac{x_n - x_n'}{n} \right) \\ &= s' + \frac{x_n - x_n'}{n-1} \left( \frac{n-1}{n} x_n + \frac{n+1}{n} x_n' - \bar x' + \frac{x_n - x_n'}{n} - \bar x \right) \\ &= s' + \frac{x_n - x_n'}{n-1}(x_n + x_n' - (\bar x + \bar x')). \tag{4} \end{align}$$


To apply this result, suppose $n = 7$, the incorrect value is $x_7' = 10$, the incorrect mean is $\bar x' = 15$, and the incorrect variance is $s' = 19$. The correct value is $x_7 = 5$. Then the correct mean is $$\bar x = 15 + \frac{1}{7}(5 - 10) = \frac{100}{7},$$ and the correct variance is $$s = 19 + \frac{5 - 10}{7 - 1}\left(5 + 10 - (\tfrac{100}{7} + 15)\right) = \frac{649}{21}.$$