Minimum of the variance of a data set given the variances of subsets

45 Views Asked by At

Suppose we have a population data set $X$ which is partitioned into two subsets $A$ and $B$, with population variance $3$ and $4$, respectively. Is it true that the population variance of $A$ is at least $3$ (i.e. $\min\{3, 4\}$)?

Intuitively I think this is true, because the "more concentrated" subset $A$ will only be made more dispersed by the inclusion of the "more dispersed" subset $B$. However, I am not able to prove it rigorously using inequalities and the definition of variance. Any help is much appreciated.

1

There are 1 best solutions below

1
On BEST ANSWER

After a few attempts at deriving inequalities using different versions of the definition of population variance, I finally can prove this fact.

Let $n_1 = |A|$, $n_2 = |B|$, and $n = |X| = n_1+n_2$. Throughout I will use $a$, $b$, and $x$ to denote a generic element in the set $A$, $B$, and $X$, respectively. The summations will then be understood as summing over all elements in the corresponding set. Also, $\bar{a}$ denotes the arithmetic mean of all elements in $A$, and similarly for $\bar{b}$ and $\bar{x}$.

Proposition: $\frac{1}{n} \sum(x-\bar{x})^2 \geq \min \big\{ \frac{1}{n_1} \sum(a-\bar{a})^2, \frac{1}{n_2} \sum(b-\bar{b})^2 \big\} $.

Proof: WLOG, assume $\frac{1}{n_2} \sum(b-\bar{b})^2 \geq \frac{1}{n_1} \sum(a-\bar{a})^2$.

Then we need to show that $\frac{1}{n} \sum(x-\bar{x})^2 \geq \frac{1}{n_1} \sum(a-\bar{a})^2$. First, we prove the following lemma:

Lemma: $\sum (x-\bar{x})^2 \geq \sum (a-\bar{a})^2 + \sum (b-\bar{b})^2$.

Indeed, applying the Cauchy-Schwarz inequality to the $n$-vectors $(\bar{a}, \bar{a}, ..., \bar{a}, \bar{b}, \bar{b}, ..., \bar{b})$, where the first $n_1$ components are $\bar{a}$ and the last $n_2$ components are $\bar{b}$, and $(1, 1, ..., 1)$, we get $$(n_1\bar{a}+n_2\bar{b})^2 \leq (n_1\bar{a}^2 + n_2\bar{b}^2)n,$$

or

$$n_1\bar{a}^2 + n_2\bar{b}^2 \geq n\bigg(\frac{n_1\bar{a}+n_2\bar{b}}{n}\bigg)^2 = n\bar{x}^2,$$

or

$$n\bar{x}^2 - n_1 \bar{a}^2 - n_2 \bar{b}^2 \leq 0,$$

so

$$n \bar{x}^2 - n_1 \bar{a}^2 - n_2 \bar{b}^2 \geq 2(n \bar{x}^2 - n_1\bar{a}^2 - n_2 \bar{b}^2).$$

Using $n \bar{x}^2 = \bar{x} \sum x$ (and similarly for $n_1\bar{a}^2$ and $n_2 \bar{b}^2$) and rearranging the terms, we get

$$- 2\bar{x}\sum x + n \bar{x}^2 \geq -2\bar{a}\sum a + n_1 \bar{a}^2-2\bar{b}\sum b + n_2 \bar{b}^2.$$

Since $\sum x^2 = \sum a^2 + \sum b^2$, we obtain

$$\sum x^2 - 2\bar{x}\sum x + n \bar{x}^2 \geq \sum a^2-2\bar{a}\sum a+ n_1 \bar{a}^2 + \sum b^2-2\bar{b}\sum b + n_2 \bar{b}^2.$$

Hence, $\sum (x-\bar{x})^2 \geq \sum (a-\bar{a})^2 + \sum(b-\bar{b})^2$.

Now, with the Lemma and the assumption $\frac{1}{n_2} \sum(b-\bar{b})^2 \geq \frac{1}{n_1} \sum(a-\bar{a})^2$ in mind, we get

$n_1 \sum(x-\bar{x})^2 \geq n_1 \sum(a-\bar{a})^2 + n_1 \sum(b-\bar{b})^2$ $\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \geq n_1 \sum(a-\bar{a})^2 + n_2 \sum (a-\bar{a}^2) = n\sum(a-\bar{a})^2,$

and therefore

$$\frac{1}{n} \sum(x-\bar{x})^2 \geq \frac{1}{n_1} \sum(a-\bar{a})^2.$$

Remark: The equality holds iff the equality in the C.S.-inequality holds AND $\frac{1}{n_2} \sum(b-\bar{b})^2 = \frac{1}{n_1} \sum(a-\bar{a})^2$.

The C.S.-inequality holds iff $(\bar{a}, \bar{a}, ..., \bar{a}, \bar{b}, \bar{b}, ..., \bar{b}) = \lambda (1, 1, ..., 1)$ for some $\lambda \in \mathbb{R}$, iff $\bar{a} = \bar{b}$.

Hence, the equality holds in the Proposition if and only if the data subsets $A$ and $B$ have the same mean and population variance.