Random samples and the distribution w.r.t the population

79 Views Asked by At

I have read a couple of hours about random sampling and the distribution and I guess that I have figured it out, but I am not 100 % sure. So, maybe one could cross-check my claims :-)

Assume we have a Population $P$ where we are interested in e.g. the body weight per individual.

  • We assume that the distribution of the body weight of the population is $W_0$.
  • Let $W$ denote the body weight of an individual within this population (so, $W$ and $W_0$ should be identically distributed, right?)
  • Now, we assume that the observed body weight for the $i$th individual is $W_i$. Hence, $W_i$ is also a random variable and $W_i$ has the same distribution as $W$ since $W$ is the distribution of an individual of the population, correct?

The conclusions I draw and where I am not sure about are bold. Thank you very much in advance!

1

There are 1 best solutions below

0
On BEST ANSWER

You might believe that weights in a certain population are normally distributed (weights in pounds), with mean $\mu = 150$ and some unknown standard deviation $\sigma.$ To check this belief you might test $H_0: \mu = 150$ against $H_a:\mu \ne 150.$

You randomly sample $n = 100$ subjects from some normal population, obtaining weights $W_i, W_2, \dots W_{100}$ with sample mean $\bar W = \frac 1n \sum_i X_i = 149.65$ and sample standard deviation $S = \sqrt{\frac{1}{n-1}\sum_i (W_i - \bar W)^2} = 20.05.$

stripchart(w, pch="|")

enter image description here

So you didn't get exactly $\bar W = \mu = 150.$ The question is whether, considering the variability of weights in this population, the difference between $\bar W$ and $\mu = 150$ is due to chance of whether the difference is 'statistically significant' at the 5% level.

Then you could use a one-sample t test to decide the issue of statistical significance. In this case, the tests statistic is $T = \frac{\bar X - 150}{S/\sqrt{n}} = -0.17681.$

You would reject $H_0$ at the 5% level of significance if $|T| \ge c = 1.984,$ where $T \sim \mathsf{T}(\nu = n-1 = 99),$ Student's t distribution with 99 degrees of freedom. The critical value $c=1.984$ cuts probability $0.025$ from the upper tail of the t distribution. [In R, qt is the quantile function (inverse CDF) of a t distribution.] We do not reject $H_0$ because $|T|$ is so near the critical value $c.$

Roughly speaking, a t statistic near $0$ is a way of saying that the sample mean $\bar W$ and the population mean $\mu$ are not remarkably different.

qt(.975, 99)
[1] 1.984217

Also, the P-value of this test is the probability of a value of $T$ might be more extreme (in a positive or negative directions) than the observed value $-0.177.$ [In R, pt is the CDF of a t distribution.] We do not reject $H_0$ at the 5% level because the P-value exceeds $0.05 = 5\%.$

Roughly speaking, a P-value above 5% is another way of saying that the sample mean $\bar W$ and the population mean $\mu$ are not remarkably different.

pt(-0.17681, 99) + 1 - pt(0.17681, 99)
[1] 0.8600188

All of this (except for the critical value $c$ of a test at the 5% level), is shown as output to the procedure t.test in R, as follows.

t.test(w, mu = 150)

        One Sample t-test

data:  w
t = -0.17681, df = 99, p-value = 0.86
alternative hypothesis: 
  true mean is not equal to 150
95 percent confidence interval:
 145.6676 153.6235
sample estimates:
mean of x 
 149.6455 

In addition, the output of t.test shows a 95% confidence interval $(145.67,\, 153.62).$ This indicates the $\mu = 150$ (inside the interval) is a believable value of the population mean based on what we see in the sample.

Below is a plot of the density function of the distribution $\mathsf{T}(99).$ The observed value of $T = -0.1768$ is shown as a solid vertical line. The P-value is twice the area under the curve to the left of this line. (The dotted vertical line is as far from 0 as the solid one.)

The critical values, $\pm c = \pm 1.984$ are shown as vertical red dashed lines.

enter image description here

R code for figure:

hdr="Density of T(99)"
curve(dt(x, 99), -4, 4, ylab="PDF", xlab="t",
     col="blue", lwd=2, main=hdr)
 abline(h=0, col="green2")
 abline(v=0, col="green2")
 abline(v = -0.1768, lwd=2)
 abline(v = 0.1768, lty = "dotted")
 abline(v = c(-1.9842,1.9842), col="red", lty="dashed")

Notes: (1) In case it is of interest, here is R code used to sample the fictitious data used above:

set.seed(411)
w = rnorm(100, 150, 20)
mean(w);  sd(w)
[1] 149.6455
[1] 20.04775

(2) In case you had advanced information that standard deviations of weights in this population are $\sigma = 20.$ The you could do a z test instead of a t test.

Some people seem to think it is OK to do a z test anytime $n > 30,$ using the sample SD $S$ as if it were the same as $\sigma$ (rarely exactly true). That's a somewhat risky approximate procedure.

For my particular fictitious data, it happens that an approximate z test would not have led to rejecting $H_0.$ R does not have a named procedure for z tests. In case it is of interest, I'll show results from a z test from Minitab statistical software below.

One-Sample Z 

Test of μ = 150 vs ≠ 150
The assumed standard deviation = 20

  N    Mean  SE Mean       95% CI           Z      P
100  149.65     2.00  (145.73, 153.57)  -0.17  0.861