In calculating the standard deviation, why do we square the difference from the mean, as opposed to cubing?

8.6k Views Asked by At

This question has been bothering me for a while: In calculating the standard deviation, why do we square the difference from the mean, as opposed to cubing the differences (and then taking the cube root at the end)? Is this just a random choice, since eventually you have to choose some exponent, and "2" was sufficient? Or is there a deeper mathematical reason?

Thanks in advance!

3

There are 3 best solutions below

2
On BEST ANSWER

An intuitive way of thinking about it is that standard deviation is a measure of spread. So you need some way of saying for every value in the data set, on average how far away is that value from the mean.

So you take the differences $d_i = x_i - \bar{x}$. What do you do with the differences? If you simply add them up, you'll get zero. You want to give equal weight to differences that are positive $d_i=+k$ as negative $d_i=-k$ so the first thing you think of is to take the absolute value: spread = $\sum|d_i|$. This is called the average absolute deviation, and it is, along with related measures, an accepted way to measure spread.

Unfortunately, taking the absolute value doesn't play well with calculus, and you want to use calculus to differentiate the spread to be able to minimize it. Minimizing it is important when for example you want to fit a line to a set of data points. So what function gives equal weight to $\pm d_i$ and is easy to differentiate? The simplest function is taking the square of each difference. The average of squared differences, the variance, is easy to differentiate and we can scale back to the size of our original data items by taking the square root of the sum to get standard deviation.

So, at last, why not take the cubed differences? It's because taking the cube of a negative difference gives a different result to taking the cube of a positive difference of the same magnitude. I.e, if $d_i=-k$ and $d_j=k$ then $d_i^3 \neq d_j^3$. And in fact, with a symmetric distribution, taking the average of the cubed differences gives you zero, which clearly isn't a measure of spread. What taking the cubed differences does is tell you how skewed the distribution is. (If you go further and look at taking fourth powers, you get into something called kurtosis, which measure how fat the tails of the distribution are.)

0
On

Technically the smallest power that will accomplish smoothness at zero is 1+ϵ, as the derivative will be left with x^ϵ which then vanishes, basically minimally blunting the sharp 'tip' of the absolute value function. I mention this not to be pedantic, but because it can be useful if you really do want something that largely behaves like the absolute value function but want to have less bad behavior near zero. Another approach is to use something like the Huber loss function

3
On

The sum of cubes of differences with the mean (or integral, in the case of a continuous distribution) gives the third central moment, known as a measure of skewness, rather than variance in the second moment case.

More technically, the skewness $\gamma_1$ is defined by:

$$ \gamma_1 = E[(X-\mu)^3]/\sigma $$

where $\mu$ is mean of the distribution of random variable $X$ and $\sigma$ is the standard deviation (square root of the variance).