Degrees of Freedom, Standard Deviation, and the Geometry of it all

110 Views Asked by At

This question is going to be a little broad because I'm still quite the novice in about what I'm asking about. Regardless, here goes:

I've started learning statistics, and something that I've found really beautiful is the connection between statistics and geometry.

As I've come to understand the degrees of freedom, they're the number of dimensions of the sub-space that a random vector is constrained to. Here, let me start by giving a brief example, and then I'll get on with my question:


Brief Example:

If $\vec{x}$ is a vector with $n$ independent observations of some random variable $X$, then it has $n$ degrees of freedom, because each of its components can take on whichever value, regardless of the values the other components took. In other words, $\vec{x}$ exists in $n$ dimensional space.

As another example, let $\bar{x}$ be the mean of our $n$ datapoints.

Then, the error vector...

$$\vec{e} = \begin{bmatrix}x_1-\bar{x}\\x_2-\bar{x}\\ x_3-\bar{x}\\ \vdots \\x_n-\bar{x}\end{bmatrix}$$

...which contains the errors of our sample-set from its mean as its components, is constrained to an $(n-1)$ dimensional subspace, as $\vec{e}\cdot\vec{1}$, where $\vec{1}$ is a vector with $n$ rows of $1$, is equal to zero.

That is, the sum of the absolute errors from the mean is zero.

Wikipedia explains it better than I do btw: https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)


My Question

I originally had a longer question...but, as I don't want to say too much because I may very well be extremely confused...

What role, geometrically, does $n-1$ play when calculating the standard deviation of a sample set?

Here are my thoughts so far...

If we know the mean of the population which we took the sample from, and the sample is to be representative of the population, then the mean of the sample must be equal to the mean of the population, and thus it can only exist in $(n-1)$ dimensional space (because once we know $n-1$ rows, we must also know the last component if the sample is to have a specific known mean).

Then, for some reason, we need to divide the length of this vector by the square root of the number of dimensions it can exist in...why?

....

I know my question isn't yet very clear, but that's because I'm still quite confused. I'll try to update it more as I learn more about degrees of freedom.

Thank you.

1

There are 1 best solutions below

2
On

There are some issues in your question.

You say "If we know the mean of the population which we took the sample from, and the sample is to be representative of the population, then the mean of the sample must be equal to the mean of the population". That is not correct, and we would expect the sample mean $\bar x$ and the population mean $\mu$ to be slightly different because of the nature of the random sampling.

It is this difference that affects how you estimate the population variance from your sample data. If the population has variance $\sigma^2 \gt 0$ then $\mathbb E\left[\frac1n \sum (x_i-\mu)^2\right] = \sigma^2$ and so $\frac1n \sum (x_i-\mu)^2$ is a natural estimator of $\sigma^2$.

The sample $\bar x$ is closer to the sampled $x_i$s than the population $\mu$ is, in mean-square sense, meaning $\frac1n \sum (x_i-\bar x)^2 \lt \frac1n \sum (x_i-\mu)^2$ unless $\bar x= \mu$. In fact $E\left[\frac1n \sum (x_i-\bar x)^2\right] = \frac{n-1}{n} \sigma^2$, and this makes $\frac1{n-1} \sum (x_i-\bar x)^2$ an unbiased estimator of $\sigma^2$. If you know $\mu$ then $\frac1n \sum (x_i-\mu)^2$ is in a sense a better estimator, but if you do not know $\mu$ then $\frac1{n-1} \sum (x_i-\bar x)^2$ is not unreasonable. Taking the square root of the variance for the standard deviation loses the exact unbiasedness but leads to the $\frac1{\sqrt{n-1}}$ you mention in your question.

Strictly speaking, the move from $n$ dimensions to a $n-1$ dimensional surface, with the so-called loss of a degree of freedom, is not in itself causing the particular change in the divisor, but it has this effect in these particular circumstances because of the nature of that surface.