questions about 2 sample t-tests

83 Views Asked by At

So I'm just a bit confused about 2 sample t-tests and just want to write out what I think I know and see if that's correct, so if anyone could tell me whether or not what I'm writting is true that would be great.

What I'm mostly asking about is if you have 2 samples with populations $X$ and $Y$ respectively and you want to measure at some confindence intervel some hypothesis about the difference of their means.

The part that's confusing me is I know 2 methods of doing this and I'm not sure which should be applied where so the first one is:


1) Let me start by saying you're always given $\sum x, \sum y, \sum x^2, \sum y^2$. So what we do is we calculate $\overline x, \overline y$ and we calculate the estimate for $Var(X)$ and $Var(Y)$ using the formulas $$\frac{1}{n}(\sum x^2 -\frac{(\sum x)^2}{n})$$ and then once we have both variances we then define $Z=X-Y$ for example, then we calculate $Var(Z)$ by $$Var(Z)=\frac{Var(X)}{n_x}+\frac{Var(Y)}{n_y}$$ and then we get our t-value by $$t=\frac{\overline x - \overline y}{Var(Z)}$$ and then simple check the t-table.


2) Method 2 goes by calculating the pooled estimate of population variance through:$$S^2=\frac{\sum (x-\overline x)^2+\sum (y - \overline y)^2}{n_x + n_y -2}. $$ Then once we have that we calculate the t value using: $$t=\frac{\overline x - \overline y}{S^2(\frac{1}{n_x}+\frac{1}{n_y})}$$


Now my understanding is that method 2 requires the assumption that the variance of both $X$ and $Y$ is equal, is that the only difference between the two? Is the second method usually more accurate if the variance is actually equal

1

There are 1 best solutions below

0
On BEST ANSWER

The language "you want to measure at some confindence intervel some hypothesis about the difference of their means" confuses confidence intervals and hypothesis testing. You probably want to say 'significance level' instead of 'confidence interval'.

In the first method, your notation seems to conflate population and sample variances. If $Var(X)$ and $Var(Y)$ are the population variances of the two populations, then your displayed equation for $Var(Z)$ is correct. However, typically in practice, you cannot 'compute' any of these three variances because they would not be known. If they are known, then your last equation in (1) should be $z = (\bar x - \bar y)/SD(X - Y)$ because, under the null hypothesis of equal population means (and assuming normal data), this $z$ would have a standard normal distribution, not a t distribution.

In (2), your $S$ (commonly denoted $S_p$) is the 'pooled' variance estimate, assuming that the two populations being compared have equal variances. However, the formula for the t statistic needs to have denominator $S_p\sqrt{1/n_x + 1/n_y}.$ As @MichaelHardy notes, such a t statistic can be used to test whether two populations means are equal, assuming normality, independent random samples, and equal variances.

Notice that t statistics must be unit-less quantities. If x's and y's are in cm, then $\bar x - \bar y$ is in cm. But $S_p^2$ in is square-cm. If the denominator has $S_p$ (in cm), then units 'cancel' so that $t$ is a pure number without units.

The assumption of equal variances without good reason, is risky business. If you want a related t procedure that does not assume equal variances, then you should consider the Welch 'separate variances' t test. (Sometimes people say 'unequal variances' t test, but the test is used when one is not sure whether variances are equal or not.)