What is the official proof for the standard deviation? And why we use n-1 when taking a sample rather than n? And how will n-1 will be affective if we took a sample of 1million from 90million popularion?
Trouble understanding Standard deviation
142 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 3 best solutions below
On
First, when people say "standard deviation", they could mean a tons of different things: population standard deviation, sample standard deviation, corrected sample standard deviations, standard deviation of a random variable, etc.
It is well-known that given a random variable $X$, its standard deviation is simply the square root of its variation. That is,
$$\sigma(X)=\sqrt{\text{Var}[X]}$$
The "standard deviation" that has $n-1$ term is most likely to be the "corrected sample standard deviation". The reason we have $n-1$ instead of $n$ is that, if you use $n$, your estimation of the population standard deviation will be biased.
On
Extended comment: This is an additional posting to give simulated results for mean squared error (MSE) and mean absolute error (MAE), using various denominators $n-1, n-2, \dots, n+2$ in estimating the population variance. This simulation arose out of a discussion with OP that extends a bit beyond the original Question. Results here are for normal samples of size $n=5,$ specifically $Norm(\mu = 100, \sigma = 15).$ But the general conclusions hold for other normal populations and other $n$ as well. Distinctions between denominators are relatively large for $n = 5.$
The MSE of is a criterion for 'closeness' of an estimator $T$ to a parameter $\tau.$ $$ MSE_T = E[(T - \tau)^2] = b_T^2 + Var(T),$$ where the 'bias' $b_T = E(T - \tau).$ Thus the MSE of an unbiased estimator is equal to its variance. In simulations it is usual to use $RMSE = \sqrt{MSE}$ in order to keep all quantities in the same units as the original population and data.
According to MSE, it is possible for an estimator with small bias and small variance to be judged 'better' than an unbiased estimator with large variance.
Another criterion for closeness is $MAE_T = E[|T - \tau|],$ which puts less emphasis on extreme estimation errors.
Letting $Q = \sum(X_i - \bar X)^2,$ the simulation below with normal data shows results for estimators $Q/(n-3)$ through $Q/(n+2).$ In addition to the usual $V_{std} = Q/(n-1),$ which is unbiased, "winners" are $V_y = Q/(n+1)$ (smallest MSE) and $V_x = Q/n$ (smallest MAE). Based on a million normal samples of size $n = 5,$ values should be accurate to about three significant digits.
SUMMARY TABLE WITH APRX VALUES
Denom Expectation RMSE MAE
n-3 450 390 272
n-2 300 226 160
n-1 225* 160 122
n 180 135 112*
n+1 150 130* 113
n+2 129 132 118
The simulation in R statistical software is shown below:
m = 10^6; n = 5; x = rnorm(m*n, 100, 15); DTA = matrix(x, nrow=m)
a = rowMeans(DTA); ssq = rowSums((DTA - a)^2)
v.u = ssq/(n-3); v.v = ssq/(n-2); v.std = ssq/(n-1)
v.x = ssq/n; v.y = ssq/(n+1); v.z = ssq/(n+2)
mean(v.u); mean(v.v); mean(v.std); mean(v.x); mean(v.y); mean(v.z)
[1] 450.0821
[1] 300.0547
[1] 225.041 # UNBIASED: pop var is 225
[1] 180.0328
[1] 150.0274
[1] 128.5949
# RMSEs
sqrt(mean((v.u-225)^2)); sqrt(mean((v.v-225)^2)); sqrt(mean((v.std-225)^2))
[1] 390.4485
[1] 225.549
[1] 159.5212 # exact: sqrt(2*15^4/(5-1) = 159.099
sqrt(mean((v.x-225)^2)); sqrt(mean((v.y-225)^2)); sqrt(mean((v.z-225)^2))
[1] 135.3076
[1] 130.118 # smallest
[1] 132.677
# MAEs
mean(abs(v.u-225)); mean(abs(v.v-225)); mean(abs(v.std-225))
[1] 271.8954
[1] 159.5899
[1] 122.0998
mean(abs(v.x-225)); mean(abs(v.y-225)); mean(abs(v.z-225))
[1] 111.7595 # smallest
[1] 112.5548
[1] 117.9307
A plot of the simulated distributions of the six estimators is shown below. Reading across the first row denominators $n-3$ and $n-2$ are objectionable because of large positive bias with long tails of extreme overestimates. At lower right, $n + 2$ is objectionable because of large negative bias (not offset by the smaller right tail of extreme overestimates). Vertical lines at $\sigma^2 = 225$ show the parameter value being estimated.

@Misakov is correct that $n-1$ is part of the definition of the sample standard deviation. So there is no 'proof' that one should use $n-1$ instead of $n$. However, there are reasons that $n-1$ appears in the definition.
The reason can be explained at various levels. I'll give a few.
Computational.
The sample variance is defined as $S^2 = \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1}$ and the standard deviation (SD) is the square root of that. Suppose you want to compute the variance of data 3, 1, 6, 2 with $n = 4.$ First, the sample mean is $\bar X = (3 + 1 + 6 + 2)/4 = 3.$ Then you might use the following table:
So $S^2 = 14/3$ and the SD $S = \sqrt{14/3} = 2.16.$
Now imagine that the second row is, for some reason, illegible (faulty FAX transmission or old photocopy machine). Then, given the the rest of the table including the totals, you could reconstruct the data and the variance. (E.g., 3 + smudge + 6 + 2 = 12, so smudge must be 1. And similarly, $(X_2 - \bar X)^2$ must be 4.)
To describe this situation the terminology 'degrees of freedom' (df) has been used. Given the structure of the computation, only three of the observations carry information, not four. So we say df $ = n - 1 = 4 - 1 = 3.$ That is why we 'average' the squared deviations by dividing by df instead of $n.$
Linear Algebra.
Statisticians tend to think of data $X = (3, 1, 6, 2)$ as a vector in $n$-dimensional space. Roughly speaking, one linear relationship is used to estimate the population mean $\mu$ by $\bar X$, leaving $n-1$ dimensions to estimate the population variance $\sigma^2$ by $S^2.$ In many applications 'degrees of freedom' means sub-dimensions in an n-dimensional space.
Estimation.
If $S^2$ is defined with $n-1$ in the denominator, then considering $S^2$ as a random variable, we have $E(S^2) = \sigma^2.$ Roughly, this means that when the sample variance is used to estimate the population variance it is an 'unbiased' estimator--not always giving the exact value of $\sigma^2$ but not systematically either overestimating or underestimating $\sigma^2.$ If we were to define the sample variance as $V = \frac{\sum (X_i - \bar X)^2}{n},$ then, as an estimate of $\sigma^2,$ the estimate $V$ would tend to be too small.
Notes: (1) if $X_1, X_2, \dots, X_N$ is considered as the entire population, then we define $\sigma_X^2 = \frac{\sum_{i=1}^N (X_i - \bar X)^2}{N}.$ If you have a statistical calculator, you may have two buttons for the variance, one labeled (for abbreviation) something like $\sigma_{n-1}$ and another labeled $\sigma_{N}$; the former is for the sample variance and the latter for the population variance.
(2) You are certainly not the first person to wonder about this. A few elementary statistics texts (e.g, Freedman, Pisani, and Purves) have defined $S^2$ with $n$ in the denominator, on the grounds that for large $n$ there is not much difference between using $n$ and $n-1.$ That has not become standard, possibly because it requires extra discussion later on when doing estimation from small samples and when using Student's t and chi-squared distributions for hypothesis testing or making confidence intervals. Also, I have seen a few recent statistics-education oriented articles that advocate using $n.$
(3) [Added later] I see you asked about using $n-2$ or $n-3$. Those aren't good choices. But when using normal data, there is some theoretical justification for $n$ or $n+1.$ Dividing by these gives biased results, but unbiasedness isn't everything. Criteria for the estimate being 'close' to $\sigma^2$ indicate $n$ or $n+1$ might be desirable ('minimum mean absolute error' and 'minimum mean squared error', respectively).