Sample Variance and Population Variance for Ungrouped data

568 Views Asked by At

A study of the effect of smoking on sleep patterns is conducted. The measure observed is the time, in minutes, that it takes to fall asleep. These data are obtained:

Smokers: $69.3, 56.0, 22.1, 47.6, 53.2, 48.1, 52.7, 34.4, 60.2, 43.8, 23.2, 13.8$

Non-Smokers: $28.6, 25.1, 26.4, 34.9, 28.8, 28.4, 38.5, 30.2, 30.6, 31.8, 41.6, 21.1, 36.0, 37.9, 13.9$

I have to find Variance for both of the given groups

Question

As we know there are two types of Variance sample variance and other is population Variance, Which type of Variance I should use to find the variance of above(given) ungrouped data and why?

2

There are 2 best solutions below

0
On

This is a common source of confusion: which formula you use depends on the motivation behind the question. Specifically, are you need clarify what you mean by "these groups". Can you identify whether:

(a) you are interested in this particular sample of smokers and non-smokers, so you are trying to calculate summary statistics to describe these individuals;

or

(b) you are interested in the populations of smokers and non-smokers from which these samples were taken, so are trying to estimate the variability in sleep times in those populations?

In case (a) you are interested in the sample variance: $$ s^2 = \frac{1}{n} \sum_{i =1}^{n} (x_i - \bar x)^2.$$

In case (b) your aim to estimate the population variance $\sigma^2$ using this sample. The sample variance is a biased estimator of the population variance (it does not converge to the population variance $\sigma^2$ as your sample size $n$ becomes large), but we can correct for this bias by using the estimate:

$$\hat\sigma^2 = \frac{1}{n-1} \sum_{i =1}^{n} (x_i - \bar x)^2.$$

0
On

Sample variance has $n-1$ in denominator, which makes it an unbiased estimator of the population variance.

Data:

smok = c(69.3, 56.0, 22.1, 47.6, 53.2, 48.1,
         52.7, 34.4, 60.2, 43.8, 23.2, 13.8)

Using sample variance function in R:

var(smok)
[1] 286.5491

Formula: $S^2 = \frac{1}{n-1}\sum_{i=1}^{12}(X_i-\bar X)^2,$ where $\bar X = \frac{1}{n}\sum_{i=1}^{12} X_i.$

Using R as a calculator:

a = mean(smok);  a        # sample mean
[1] 43.7
sum(smok)/length(smok)    # sample mean again
[1] 43.7
> n = length(smok);  n    # sample size
[1] 12
ss = sum((smok-a)^2);  ss # numerator of samp var
[1] 3152.04
ss/(n-1)                  # sample var again
[1] 286.5491

Now suppose I take a million samples of size $n=10$ from the a normal population with $\mu = 50, \sigma^2 = 9, \sigma = 3.$ Make a vector v of the million sample variances. Finally, take the mean of v, which should be very nearly $\sigma^2 = 9.$ This illustrates unbiasedness, sometimes written as $E(V) = E(S^2) = \sigma^2.$

set.seed(2021) # for reproducibility
v = replicate(10^6, var(rnorm(10, 50, 3)))
mean(v)
[1] 9.005739  # nearly 9 as predicted.

By contrast, let's look at a population variances. If the population consists of $N = 6$ objects $\{1,2,3,4,5,6\}.$ then the population mean is $$\mu = (1+2+3+4+5+6)/6 = 3.5.$$ And the population variance is $$\sigma^2=[(1-3.5)^2+(2-3.5)^2+\cdots+(6-3.5)^2]/6 =2.916667.$$

If the $N$ objects in a population are $X_1, X_2, \dots, X_N,$ then $\mu = \frac 1N\sum_{i=1}^N X_i$ and $\sigma^2 = \frac 1N \sum_{i=1}^N (X_i - \mu)^2.$

Finally, you are interested in comparing variances of smokers and nonsmokers. I already have data for smok in R from the discussion above.

nons = c(28.6, 25.1, 26.4, 34.9, 28.8, 
         28.4, 38.5, 30.2, 30.6, 31.8, 
         41.6, 21.1, 36.0, 37.9, 13.9)

The sample variances from R, are as below. Also, one popular test of $H_0: \sigma_s^2/\sigma_n^2 = 1$ against $H_a: \sigma_s^2/\sigma_n^2 \ne 1$ uses the ratio of the variances as a test statistic.

var(smok);  var(nons)
[1] 286.5491
[1] 50.94695
var(smok)/var(nons)
[1] 5.62446

In R this test of equal variances is done as follows:

var.test(smok, nons)

        F test to compare two variances

data:  smok and nons
F = 5.6245, num df = 11, denom df = 14, p-value = 0.003438
alternative hypothesis: 
  true ratio of variances is not equal to 1
95 percent confidence interval:
  1.817514 18.891493
sample estimates:
ratio of variances 
           5.62446 

Because the variances are quite different, if you are testing whether the means of smokers and nonsmokers are significantly different, it is especially important to use the Welch 2-sample t test (instead of the pooled test).

t.test(smok, nons)

        Welch Two Sample t-test

data:  smok and nons
t = 2.5747, df = 14.127, p-value = 0.02191
alternative hypothesis: 
  true difference in means is not equal to 0
95 percent confidence interval:
   2.254755 24.638578
sample estimates:
mean of x mean of y 
 43.70000  30.25333 

This test uses sample variances, but does not display them.