Basic questions about 2-sample t-test.

455 Views Asked by At

I am learning statistics so please bear with me in answering my questions. There is a common example in the statistics tutorials about the calorie content of the different brands of beef and poultry hotdogs.

Beef hotdogs: 186, 181, 176, 149, 184, 190, 158, 139, 175, 148, 152, 111, 141, 153, 190, 157, 131, 149, 135, 132

Poultry hotdogs: 129, 132, 102, 106, 94, 102, 87, 99, 170, 113, 135, 142, 86, 143, 152, 146, 144

The question is can it be inferred that the calorie content of poultry hotdogs is lower than the calorie content of beef hotdogs with the significance level equals to 0.05?

To answer this, I initially calculated the mean and standard deviation of each sample and then used 2-sample t-test. I have some questions to ask:

  1. To calculate $t_\alpha$, is it necessary to use Pooled Standard Deviation in other words which of the following formula should be used?

    $t_\alpha = \frac{\mu_1 - \mu_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$

    or

    $t_\alpha = \frac{\mu_1 - \mu_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$

    I used both and noticed that the difference in the results was less than 0.1 and in this question this difference could be ignored but I wonder which is standard to use as I have seen the usage of both in the same situations in different tutorials?

  2. To set $H_0$ and $H_1$, based on what question asks which of the following is correct:

    • $H_0$ : "The calorie content of poultry hot dogs is not lower than the calorie content of beef hot dogs." and $H_1$ : "The calorie content of poultry hot dogs is lower than the calorie content of beef hot dogs."

    • $H_0$ : "There is no difference in the means of two samples." and $H_1$ : "There is a difference in the means of two samples."

    I am asking this since the question is using the term "lower" and also there is a difference between the two bullet points. In the second bullet point, with $H_1$ when it is said there is a difference, this can be either "the calorie content of poultry hot dogs is lower than the calorie content of beef hot dogs" or "the calorie content of poultry hot dogs is greater than the calorie content of beef hot dogs." whereas in the first bullet point it $H_1$ clearly states "The calorie content of poultry hot dogs is lower than the calorie content of beef hot dogs." And I think these different types of setting $H_0$ determines whether to use one-tailed test or two-tailed one.

  3. The question continues: "After mixing a certain food for calves, calories observed on the same sample of 17 brands of beef hotdogs as used above are: 181, 191, 186, 129, 178, 194, 139, 122, 195, 158, 158, 104, 132, 185, 168, 148, 123 Can it be inferred that mix of food has an effect in calorie of the beef hotdog?"

    I think here obviously $H_0$ must be: mix of food has no effect and $H_1$ : mix of food has an effect

1

There are 1 best solutions below

1
On BEST ANSWER

Welch and pooled two-sample t tests. There are two kinds of two-sample t tests.

(a) The pooled test in which it is assumed that the two population variances are equal and the pooled variance is $S_P^2 = \frac{(n_1-1)S_1$ + (n_2-1)S_2^2}{(n_1+n_2 - 2)},$ a degrees-of-freedom weighted average of the two sample variances. This gives rise to your second formula for the (pooled) t statistic. Under the null hypothesis the t statistic has Student's t distribution with $\nu = n_1+n_2-2$ degrees of freedom.

(b) By contrast, the Welch 2-sample t test does not assume that the two populations have the same variance. This gives rise to your first formula for the (Welch) t statistic, in which the two variance estimates $S_1^2$ and $S_2^2$ are used separately. If $n_1 = n_2,$ one can show that the two t statistics are equal. However in the Welch test, under $H_0,$ the t statistic has approximately Student's t distribution with $\nu^\prime$ degrees of freedom, where $\nu^\prime$ is determined by a formula containing $n_1, n_2, S_1^2, S_3^2$ giving $\min(n_1-1,n_2 -1) \le \nu^\prime \le n_1+n_2-2.$ (If $S_1^2\approx S_2^2,$ then $\nu^\prime$ is nearer the larger possible value.)

The Welch test is preferred unless we have substantial prior knowledge that $\sigma_1^2 \approx \sigma_2^2.$

One and two-sided alternatives. In R, for your initial hot dog data, the one-sided Welch t test for $H_0: \mu_b = \mu_p$ against $H_a: \mu_b > \mu_p$ looks as follows:

Data and description:

b = c(186, 181, 176, 149, 184, 190, 158, 139, 175, 148, 
      152, 111, 141, 153, 190, 157, 131, 149, 135, 132)

p = c(129, 132, 102, 106, 94, 102, 87, 99, 170, 113, 
      135, 142, 86, 143, 152, 146, 144)

summary(b); length(b);  sd(b)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  111.0   140.5   152.5   156.8   177.2   190.0 
[1] 20         # sample size beef
[1] 22.64201   # sample SD beef
summary(p); length(p);  sd(p)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   86.0   102.0   129.0   122.5   143.0   170.0 
[1] 17
[1] 25.48313

 boxplot(b, p, col="skyblue2", names=c("b","p"), label=T)

enter image description here

One-sided Welch test: The P-value (near 0) shows that we can reject $H_0$ at the 1% level of significance.

t.test(b, p, alt="g")  # parameter 'g'=greater (1-sided)

        Welch Two Sample t-test

data:  b and p
t = 4.3031, df = 32.394, p-value = 7.276e-05
alternative hypothesis: 
 true difference in means is greater than 0
95 percent confidence interval:
 20.85096      Inf
sample estimates:
mean of x mean of y 
 156.8500  122.4706 

Two-sided Welch test: If you test $H_0: \mu_b = \mu_p$ against the two-sided alternative ("no difference" against "difference"), then the P-value is twice the P-value for the one-sided test above--still very highly significant. Also, for this test (without parameter alt = "g" will give a 95% CI with a lower and upper bound for the difference between the two population means.

One-sided pooled test: Here is the pooled t test for the same data. Notice that in R the Welch test is the default. If you feel sure that the two population variances must be nearly equal, then you can do the pooled t test by using the parameter var.eq=T. Notice that DF = 20 + 17 - 2 =35.

t.test(b, p, alt="g", var.eq=T)

        Two Sample t-test

data:  b and p
t = 4.3455, df = 35, p-value = 5.683e-05
alternative hypothesis: 
 true difference in means is greater than 0
95 percent confidence interval:
 21.01239      Inf
sample estimates:
mean of x mean of y 
 156.8500  122.4706 

Second data set. The analysis is similar and I will leave it to you. I think you are correct that you should do a two-sided test.