Proof and precise formulation of Welch-Satterthwaite equation

5k Views Asked by At

In my statistics course notes the Welch-Satterthwaite equation, as used in the derivation of the Welch test, is formulated as follows:

Suppose $S_1^2, \ldots, S_n^2$ are sample variances of $n$ samples, where the $k$-th sample is a sample of size $m_k$ from a population with distribution $N(\mu_k, \sigma_k^2)$. For any real numbers $a_1, \ldots, a_n$, the statistic $$ L = \frac{\nu}{\sum_{k=1}^{n} a_k\sigma_k^2}\sum_{k=1}^n a_kS_k^2 $$ with $$ \nu = \frac{\left(\sum_{k=1}^m a_kS_k^2\right)^2}{\sum_{k=1}^m \frac{(a_kS_k^2)^2}{N_k - 1}} $$ approximately (sic) follows a chi-squared distribution with $\nu$ degrees of freedom.

Doing a quick Google search, most sources seem to follow a similar formulation. There are a few questions I have though:

  1. How can the number of degrees of freedom of a chi-squared distribution depend on the statistics $S_i$? Shouldn't the number of degrees of freedom be a constant?
  2. What does 'approximately follow a distribution' mean? This is closely related to my third question:
  3. How can one justify this approximation? All 'proofs' I have seen on the internet seem to assume that $L$ follows a chi-squared distribution and then show that $\nu$ must have the stated value, using basic properties of expected value and variance. Even those arguments are confusing to mee, since they seem to ignore the fact that the $S_i$ are itself stochastic variables and not known constants.

I found a link to the original papers by Welch and Satterthwaite, which I can't access freely. I wonder if there is a reasonably short answer to at least the first two of my questions. Us students are not expected to be able to prove the equation, so it's not that bad if there is no readily available proof, but I'd like to at least understand the statement itself.

3

There are 3 best solutions below

0
On

It has been noted in another answer that the first and second moment matches between the linear combination of chi-squared variables and the Welch-Satterthwaite approximation. I think some additional perspective can be provided.

Namely:

  • Under what conditions the Welch-Satterthwaite approximation is exactly correct
  • The introduction of a bound for mentioned $\nu$ (degrees of freedom for the statistic) mentioned in the question

For which I will define slightly differently $$ \nu = \frac{\left(\sum_{k=1}^m a_k \sigma_k^2\right)^2}{\sum_{k=1}^m \frac{(a_k \sigma_k^2)^2}{n_k - 1}} $$

Setup

Unfortunately, I will have to state some notation first:

Define $$ M := \sum_{k=1}^{m} a_k S^2_k $$ where the $S^2_k$ is the unbiased sample variance for the $k$-th group of data. A well-known result is $\frac{(n_{k}-1)S_{k}^2}{\sigma_k^2} \sim \chi^2_{n_k-1} $ (e.g. related post). Thus, the defined $M$ is a linear combination of $\chi^2$ random variables. We can re-express $M$ (with $X_k \sim \chi^2_{n_k-1}$)

$$ M := \sum_{k=1}^m \frac{a_k\sigma_k^2}{n_k-1} X_k $$ (this will be useful later).

The idea of Welch-Satterthwaite approximation is to approximate $M$ with a single scaled chi-squared rv. Define a random variable $L$ (contrast this with the $L$ statistic stated in the original post) s.t.

$$ \frac{\nu L}{ \alpha } \sim \chi_\nu^2, \text{ where } \alpha := \sum^m_{k=1} a_k \sigma^2_k $$ Note, $\nu $ was already defined above, but now we can also rewrite it as

$$ \nu = \frac{\alpha^2}{\sum_{k=1}^m \frac{(a_k \sigma^2_k)^2}{n_k-1} } $$

When Welch-Satterthwaite approximation is exactly correct (i.e. $M \sim L$)

Lets set a condition such that $\forall k, \frac{a_k \sigma_k^2}{n_k-1} = c$ where $c$ is a constant. Note that the term set to a constant is the scaling factor for all the chi-squared rvs.

Scaled chi-squared rvs are gamma distributed or more precisely, $ c \cdot \chi_\nu^2 \sim \Gamma(\nu/2, 2c)$ (shape-scale parameterization). Since the scale parameters are all equal under our condition,

$$ M \sim \Gamma ( \sum_{k=1}^m \nu_k / 2 , 2c) $$

(where $\nu_k = n_k -1$ are the degrees of freedom for the individual chi-squared rvs $X_k$)

Re-using the relationship b/w chi-squared and gamma distributed rvs again.

$$ \frac{M}{c} \sim \chi^2_\nu $$

Since with our condition, $$ \alpha = c \cdot \Big( \sum_{k=1}^m n_k- 1 \Big) $$

$$ \nu = \frac{1}{c} \cdot \alpha = \sum_{k=1}^m n_k - 1 $$ which is the sum over the degrees of freedom of the individual chi-squared rvs. So the degrees of freedom match to that of $\frac{\nu L}{\alpha}$.

Also note $c = \frac{\alpha}{\nu}$, so $$ \frac{M}{c} = \frac{\nu M }{\alpha} \sim \chi^2_\nu $$

which is the same as our defined $L$.

Bounds for $\nu$

There exists a bound for $\nu$ and it is:

$$ \min_k{\nu_k} \lt \nu \le \sum_{k=1}^m \nu_k $$

We have already seen in what scenario the inequality on the right is an equality. It is the scenario in which the Welch-Satterthwaite approximation is exactly correct.

First the left inequality. Recall,

$$ \nu = \frac{\alpha^2}{\sum_{k=1}^m \frac{(a_k \sigma^2_k)^2}{\nu_k} } $$

Let $$ \nu^* := \min_k{\nu_k} $$ Then $$ \nu \ge \frac{\alpha^2}{\frac{1}{\nu^*} \sum_{k=1}^m a_k \sigma^2_k } = \nu^* \alpha $$ $$ \implies \nu \gt \nu^* $$

The right inequality can be proven with Cauchy-Schwarz. Define vectors:

$$ u := ( \sqrt{\nu_1} \dots \sqrt{\nu_m}) $$

$$ v := \Big(\frac{a_1 \sigma_1^2}{\sqrt{\nu_1}} \dots \frac{a_m \sigma_m^2}{\sqrt{\nu_m}} \Big) $$

with Cauchy-Schwarz we have $$ (u \cdot v)^2 \le (u\cdot u) (v \cdot v) \\ $$ $$ \implies \alpha^2 = \Big( \sum_{k=1}^m a_k \sigma_k^2 \Big)^2 \le \Big(\sum_{k=1}^m \nu_k \Big) \Big( \sum_{k=1}^m \frac{(a_k \sigma_k^2)^2}{\nu_1} \Big) $$ $$ \implies \nu = \frac{\alpha^2}{\sum_{k=1}^m \frac{(a_k \sigma_k^2)^2}{\nu_1}} \le \sum_{k=1}^m \nu_k $$

Conclusions

There is a scenario s.t. the Welch-Satterthwaite approximation is exactly correct. There exists a (not very tight) bound for the degrees of freedom. Also, it was mentioned elsewhere that the 1st and 2nd moments match (so I do not prove this). It was also mentioned elsewhere that the $\sigma_k^2$ terms tend to be unknown and are thus replaced with the $S_k^2$ hence why the $\nu$ stated in my post is different.

It should be noted that this still does not provide us a way to tell us how "good" the Welch-Satterthwaite approximation is if our condition ($\forall k, \frac{a_k \sigma_k^2}{n_k-1} = c$) is not true.

2
On

The following comes from my mulling over the approximation Welch points out in the paper you cite, with some more explicit mathematical manipulation and points of explanation.

The Setup

Let us consider a case where we are estimating the difference in population means between two independent Normally distributed random variables $A_1 \sim N(0, \sigma_1^2)$ and $A_2 \sim N(0, \sigma_2^2)$, and we use the unbiased estimators $\hat{\sigma_1}^2$ and $\hat{\sigma_2}^2$ for the unknown variances. Note that, assuming we have $n_1$ and $n_2$ independent and identically distributed observations from $A_1$ and $A_2$ respectively, $$\frac{\hat{\sigma_1}^2}{n_1} + \frac{\hat{\sigma_2}^2}{n_2} \sim \frac{\sigma_1^2}{n_1f_1}\chi^2_{f_1} + \frac{\sigma_2^2}{n_2f_2}\chi^2_{f_1}$$ -- this will be relevant for our "big reveal" at the end.

The Derivation

We first start a bit more generally. Let us define a random variable $X = a\chi^2_{f_1} + b\chi^2_{f_2}$. Note that $E[X] = af_1 + bf_2$ and $Var(X) = 2(a^2f_1 + b^2f_2)$.

Rather than calculate $X$'s probability distribution exactly, say via convolution, we'd like to approximate $X$'s probability distribution using a chi-square distribution with appropriate parameters and possibly with some rescaling of our input; i.e., we'd like to find $f,g$ such that

$$f_X(x)d(x) \approx f_Y\left(\frac{x}{g}\right)d\left(\frac{x}{g}\right)$$

and $Y \sim \chi^2_f$. In particular (remembering to take into account the scaling factor $g$), due to random variables being completely characterized by their CDFs (and in this case for absolutely continuous r.v's, by their PDFs), this would mean $\frac{X}{g} \approx Y \sim \chi^2_f$.

How do we make the two random variables "approximately equal" and/or determine $f$ and $g$? By making the first few moments of $X$ and $Y$ equal. Note that using the substitution $z = y/g, dz = dy/g$, we can exploit our knowledge of the "regularly scaled" chi-square distribution and find that $E[Y] = gf, Var(Y) = 2g^2f$. Set $E[X] = E[Y], Var(X) = Var(Y)$ and solve for $f$ and $g$, and we find that

$$f = \frac{(af_1 + bf_2)^2}{a^2f_1 + b^2f_2}, g = \frac{a^2f_1 + b^2f_2}{af_1+bf_2}$$

Now, the payoff! We saw earlier that for the estimated sample variance, we have $a = \frac{\sigma_1^2}{n_1f_1}$ and $b = \frac{\sigma_2^2}{n_2f_2}$. Using the fact that $f_i = n_i - 1$, we can plug in these values to our equations above and get that

$$ f = \frac{(\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2})^2}{\frac{1}{n_1-1}(\frac{\sigma_1^2}{n_1})^2 + \frac{1}{n_2-1}(\frac{\sigma_2^2}{n_2})^2}$$

We're still in a pickle, since we don't know $\sigma_1^2$ or $\sigma_2^2$. But we have unbiased, consistent estimators for both in the form of $\hat{\sigma}_1^2$ and $\hat{\sigma}_2^2$. The formula you get is obtained by "plugging in" these estimators for the parameters. By Slutsky's Theorem, the random variable created by this "plug-in" method converges to the "correct" degrees of freedom. More importantly for the cases of smaller sample sizes where one would use this formula, the estimators are unbiased. (I'll leave calculating the variance of this estimator as "an exercise for the reader".)

The above argument was made for approximating the distribution of $\sum_{i=1}^{k} a_i\chi^2_{f_i}$ with $k=2$, but it generalizes quite simply for any $k \in \mathbb{N}$. This generalization leads to the formulas you provide.

Bonus point: Relation to t-statistic.

How does this relate back to the t-test (which is a common situation in which the Welch-Satterthwaite formula is used)? We have to remember a few things.

Recall that in our setup, we had $n_1$ and $n_2$ observations from Gaussian random variables $A_1$ and $A_2$. Denote their samples means as $\bar{A_1}$ and $\bar{A_2}$ respectively; then $\bar{A_1} - \bar{A_2} \sim N(0, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2})$. For brevity, let's define $\bar{\sigma}^2 :=\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}$ and $\hat{\bar{\sigma}}^2 := \frac{\hat{\sigma_1}^2}{n_1} + \frac{\hat{\sigma_2}^2}{n_2}$.

Recall also that a random variable $\frac{Z}{\sqrt{\chi_k^2/k}}$ (where $Z \sim N(0,1)$) follows a t-distribution with $k$ degrees of freedom.

Also, let us note that given our earlier formulations for $f$ and $g$ and in this particular instance $fg = \bar{\sigma}^2$

and also recall that our derivation shows that $\hat{\bar{\sigma}}^2/g \sim \chi_f^2$ for the appropriate $f$ shown above.

Finally, we can consider the t-statistic:

$$\frac{\bar{A_1} - \bar{A_2}}{\sqrt{\hat{\bar{\sigma}^2}}} \times \frac{\frac{1}{\bar{\sigma}}}{\frac{1}{\sqrt{fg}}} = \frac{\frac{\bar{A_1} - \bar{A_2}}{\bar{\sigma}}} {\sqrt{\frac{\bar{\hat{\sigma}^2}}{fg}}} \approx \frac{Z}{\sqrt{\chi_f^2/f}} \sim t_f$$

Directly answering the questions.

  1. The dependence on the statistics $\hat{\sigma_i^2}$ comes from not knowing the population variances, upon which our "approximate" chi-square distribution depends.
  2. In this case, "approximately distributed as..." refers to the fact that $f$ and $g$ were chosen such that a random variable distributed as $\chi^2_f$ has first and second moments that match that of the random variable with the exact distribution corresponding to $\frac{\sigma_1^2}{n_1f_1}\chi^2_{f_1} + \frac{\sigma_2^2}{n_2f_2}\chi^2_{f_1}$
  3. Hopefully the above derivation explains how justifiable the approximation is or isn't, as the case may be.
1
On

See Loveland, Theorem 23, page 47. She seems to provide a really nice treatment!