Explaining the standard deviation formula

1k Views Asked by At

I'm revisiting standard deviation for the first time years, and I can't for the life of me recall the difference between two formulas. In particular, I'm also looking for how we arrived at these forumulas.

Firstly we have for the sample standard deviation $$ \sqrt{\dfrac{ \sum_{i=1}^{n}(X-\bar{X})^2}{n-1}}$$

Also we have the population standard deviation $$ \sqrt{\dfrac{ \sum_{i=1}^{n}(X-\mu)^2}{n}}$$

From what I understand, we sqaure the difference to remove negative values. After that I'm lost. Is the square root to go back to the difference but without the negatives?

Also, why do we divide by $n-1$ on sample, and by $n$ on the population? Why is there a difference and can anyone give a real example?

1

There are 1 best solutions below

1
On

The units of the variance $S^2 = \frac{1}{n-1}\sum (X_i = \bar X)^2$ are squared. (If the $X_i$ are in $cm$ then $S^2$ has units $cm^2$.) Using the sample standard deviation gets back to the original units. Thus, if $X_i$ are a random sample from a normal distribution, one can write a 95% confidence interval as $\bar X \pm t^* S/\sqrt{n},$ where $t^*$ cuts probability 2.5% from the upper tail of Student's t distribution with $n-1$ degrees of freedom.

Division by $n-1$ gives $E(S^2) = \sigma^2,$ where $\sigma^2$ is the population variance. This means that $S^2$ is an unbassed estimator of $\sigma^2.$ However, note that $E(S) \ne \sigma$; the bias is negligible for large $n$.

Also, some kinds of inference about $\sigma^2$ (for example, a confidence interval for $\sigma^2$) use the fact that $(n-1)S^2/\sigma^2 \sim Chisq(df=n-1),$ the chi-squared distribution with $n - 1$ degrees of freedom.

The formula for the population variance is usually written with a capital $N$, denoting the population size: $\sigma^2 = \frac{1}{N}\sum (X_i - \mu)^2,$ where $X_i$ are the population elements. (There is no discussion of using $N - 1$ here because there is typically no need to estimate $\sigma^2$.)

Note: There would have been nothing "wrong" with defining $S^2$ using $n$ in the denominator, and some statisticians have (belatedly) recommended that. But using $n-1$ is pervasive and changing the definition of $S^2$ now would turn out to require many adjustments in various formulas and tables used in inference.