equivalent sample size of the variance of sum of two variables

158 Views Asked by At

In t-test, the variance of $X-Y$ ($X$ is the mean of $x$, and $Y$ is the mean of $y$) is $\sigma^2/n+\sigma^2/m$ ($n$ is the sample size of $x$, and $m$ is the sample size of $y$), assuming equal variance of $x$ and $y$. What is the equivalent sample size of $X-Y$? I am asking this because I need to infer the confidence interval of $(X-Y)-(J-K)$ ($J$ is the mean of $j$, $K$ is the mean of $k$).

Can somebody give a formula on the equivalent sample size (if it exists) or suggest how to infer the confidence interval in this case?

1

There are 1 best solutions below

0
On

The only way I can make sense of your question is to suppose you have the model $Y_{ij} = \mu_i + e_{ij}$, for $i = 1, 2, 3, 4$ and $j = 1, 2, \dots, n_i,$ where $e_{ij}$ are a random sample of size $N = n_1 + n_2, + n_3 + n_4$ from a normal distribution with mean $0$ and standard deviation $\sigma.$ This is the model for a 'one-way analysis of variance' (also called 'completely randomized design').

It has four treatment groups corresponding to your variables $x, y, j, k$ for which subjects have been chosen independently.

In my notation you want to obtain a confidence interval for the 'linear contrast' $$\Lambda = (\mu_1 - \mu_2) - (\mu_3 - \mu_4) = \mu_1 - \mu_2 - \mu_3 + \mu_4,$$ which is estimated by

$$L = (\bar Y_1 - \bar Y_2) - (\bar Y_3 - \bar Y_4) = \bar Y_1 - \bar Y_2 - \bar Y_3 + \bar Y_4.$$

Then form the $T$ statistic: $$T = \frac{\hat L - \Lambda }{ S_w\sqrt{\sum_{i=1}^4 (1/n_i)}},$$ where $$S_w = \sqrt{\frac{\sum_{i=1}^4 (n_i -1)S_i^2}{N - 4}},$$ in which the $S_i^2$ are the four group variances. This $T$ statistic has degrees of freedom $df = N - 4.$ In the case where all $n_i = n$, we have the following simplifications: $T = \sqrt{n}(\hat L - \Lambda)/2S_w,$ $S_w^2 = (S_1^2 + S_2^2 + S_3^2 + S_4^2)/4,$ and $df = 4(n-1).$

Thus a 95% confidence interval (CI) for $\Lambda$ is $L \pm t^* S_w\sqrt{\sum_{i=1}^4 (1/n_i)}$, where $t^*$ cuts 2.5% from the upper tail of Student's t distribution with $df = N - 4$.

Below is a sample with fake data generated in Minitab with $\mu_1 = 100,\, \mu_2 = 110,\, \mu_3 = 150,\, \mu_4 = 110\,$ and $\sigma = 4,$ with sample sizes $n_1 = 10,\, n_2 = 12,\, n_3 = 8,\ n_4 = 15.$ Thus $\Lambda = -50.$

 Row    A    B    C    D
   1  101   98  153  110
   2  103  118  155  112
   3  101  108  145  107
   4  100  111  158  114
   5   95  112  153  108
   6  101  112  150  111
   7   96  108  148  115
   8   97  106  145  107
   9  107  114       115
  10   95  110       107
  11       108       100
  12       112       107
  13                 109
  14                 111
  15                 111

 One-way ANOVA: A, B, C, D 

 Source  DF       SS      MS       F      P
 Factor   3  13433.2  4477.7  239.94  0.000
 Error   41    765.1    18.7
 Total   44  14198.3

 S = 4.320   R-Sq = 94.61%   R-Sq(adj) = 94.22%

                           Individual 95% CIs For Mean Based on
                           Pooled StDev
 Level   N    Mean  StDev  -----+---------+---------+---------+----
 A      10   99.60   3.86  (*-)
 B      12  109.75   4.90        (-*-)
 C       8  150.88   4.70                                    (-*-)
 D      15  109.60   3.89         (*-)
                           -----+---------+---------+---------+----
                              105       120       135       150
 Pooled StDev = 4.32

From the ANOVA table $S_w^2 = \text{MS(Error)} = 18.7,\,$ $S_w = 4.32$ and $df = 41.$ Also, $\hat L = 99.60 - 109.75 - 150.88 + 109.60 = - 51.43.$ From tables of the t distribution, $t^* = 2.145.$ And $\sqrt{\sum (1/n_i)} = 0.6124.$ So the 95% CI for $\Lambda$ is $-51.43 \pm 2.145(4.32)(0.6124)$ or $-51.43 \pm 5.67$. Thus it seems clear that $\Lambda$ is significantly less than 0.

Notes: (a) This method works for one contrast chosen before seeing the data. Multiple or 'ad hoc' contrasts require modifications to control the overall error rate. (b) Many texts will show the statistic $F = T^2$ with $df_1 = 1$ and $df_2 = N-4$, which is precisely equivalent to what we have shown here. (c) Whenever possible, it is best to use a 'balanced' design with all $n_i$ equal. Not only are the formulas simpler, but this is a more efficient use of resources (for instance, to get CIs of the shortest average length for a given total number of subjects). (d) This method assumes four independent samples from four normal populations, all with the same standard deviation $\sigma.$ (e) Similar material can be found in intermediate level applied statistics texts and texts on the design of experiments, often under the heading 'multiple comparisons' or 'linear contrasts'.