What is the relationship between the empirical variance from the sampled data and its true variance?

259 Views Asked by At

To simplify my question, assume that I independently sampled $M$ data points $\{x\}_{i=1}^M$ from Gaussian distribution $N(\mu,\sigma^2)$. I can use $\{x\}_{i=1}^M$ to calculate its mean $\hat{\mu}$ and the $\hat{\sigma}$. My question is that what is the relation between the $\sigma$, $\hat{\sigma}$ and $M$? If $M$ increases, $\hat{\sigma}$ increases or decrease to $\sigma$? Can the result be extended to more general case like other distributions?

I ask this question, because in machine-learning/data-mining we may have many different algorithms/methods (saying 3 algorithms) to finish one task. In experiments, we can choose one algorithm and run one time and has the precision 90%. We can run it another time and the precision may be 92%. Each algorithm can be independently run $\beta$ times with $\beta$ precision results averaged, thus the variance of $\beta$ results can also be calculated. Then, we compare the average precision and variance for each algorithm, and the algorithm with higher average precision and smaller variance is the best. So how to choose $\beta$? If $\beta=5$ is very small, the variance may be not well calculated. If $\beta=5000$ is very big, then running $1000$ times is computationally expensive.

2

There are 2 best solutions below

0
On

When you say, you compute $\hat{\mu}$ and $\hat{\sigma}$ I am assuming you are using the standard formulas for samples. These formulas give you an estimate of the true mean and standard deviation and have the property that they are consistent, this is, one can show that they converge to the true mean and standard deviation when $M\rightarrow\infty$

As you pointed out in the second paragraph, having a small $\beta$ is like having a small $M$, so we know very little about how good of an estimate will $\hat{\mu}$ and $\hat{\sigma}$ be. The larger the $\beta$, the better, of course there is a tradeoff, as in any empirical work, someitmes you just have to deal with the fact that you don't have a big enough sample. Sometimes, as in your case, it means you will need more computing power.

1
On

First, to establish notation, consider process A, giving observations $X_1, X_2, \dots, X_n,$ on $n$ independent runs, with $X_i\text{ iid } \operatorname{Norm}(\mu_X, \sigma_Y).$ Then $S_X^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2,$ where $\bar X = \frac{1}{n}\sum_{i=1}^n X_i,$ has $(n-1)S_X^2/\sigma_X^2 \sim \operatorname{Chisq}(\mathrm{df}=n-1).$

Confidence Interval for a Population Variance or SD. Thus a 95% confidence interval (CI) for $\sigma_X^2$ is of the form $\left((n-1)S_X^2/U, (n-1)S_X^2/L\right),$ where $U$ and $L$ cut probability $2.5\%$ from the upper and lower tails, respectively, of $\operatorname{Chisq}(n-1).$ And a CI for $\sigma_X$ is found by taking square roots of the endpoints of the CI for $\sigma_X^2.$

As $n$ increases, such confidence intervals become narrower, indicating progressively better precision in estimating $\sigma_X$ with increasing $n$. However, this improvement occurs relatively slowly. In particular, if $\sigma_X = 10,$ then average lengths of such CIs for $\sigma_X$ with $n = 5, 10,$ and $20$ are about $21, 11,$ and $6,$ respectively. As I tell my students: "Sample variances are very variable."

Variance-Ratio tests for comparing population variances. One of your main purposes seems to be to compare the variances of two processes A and B giving observations $X_i$ and $Y_i,$ respectively, by looking at ratios $S_X^2/S_Y^2.$ Such ratios have the well known variance-ratio or F distribution (with $\nu_A = n_A -1$ numerator degrees of freedom and $\nu_B = n_B -1$ denominator degrees of freedom).

However, tests comparing sample variances from small samples in this way have notoriously poor 'power'; that is, poor ability to distinguish between corresponding population variances. For example, if $n_A = n_B = 5,$ (five runs with each process), then the ratio of the larger sample variance to the smaller has to be above $9.6$ to $1$ in order to be significantly distinct at the $5\%$ level of significance. If $\sigma_Y = 2\sigma_X$ (so that the ratio of population standard deviations is $2:1$ and the ratio of population variances is $4:1$), the you have about $1$ chance in $5$ of confirming inequality with an F-test.

You can read more about 'F-tests' in the Wikipedia article, and there are several papers on the Internet about the 'power of variance-ratio tests'.