k-means clustering identity

34 Views Asked by At

Suppose $x_1, x_2, \dots, x_n$ are distinct points in $\mathbb{R}^d$, $\mathcal{S} = \{ S_1, S_2, \dots, S_k \}$ is a partition of $\{ x_i : i \in [n] \}$ (with $k \leq n$), and $\mu_i = \frac{\sum_{x \in S_i} x}{|S_i|}$ is the centroid of each cluster $S_i$. Wikipedia claims that $$\mathrm{argmin}_\mathcal{S} \sum_{i = 1}^k \sum_{x \in S_i} \| x - \mu_i \|^2 = \mathrm{argmin}_\mathcal{S} \sum_{i = 1}^k \frac 1 {2 |S_i|} \sum_{x, y \in S_i} \| x - y \|^2$$ because of the identity $$\sum_{x \in S} \| x - \mu \|^2 = \sum_{x \neq y \in S} (x - \mu) \cdot (\mu - y)$$ for each $S \in \mathcal{S}$ with centroid $\mu \in \{ \mu_1, \dots, \mu_n \}$.

How does one prove this identity? Does the claim immediately follow?


Assume that there are $\ell$ points in $S$, labeled $x_1, x_2, \dots, x_\ell$ (this indexing has nothing to do with the original names for all $n$ points), so that $\mu = \frac{\sum_{i = 1}^\ell x_i} {\ell}$. My attempt was to rewrite the left side as $$\begin{align*} \sum_{x \in S} \| x - \mu \|^2 &= \sum_{i = 1}^\ell \| x_i \|^2 + \| \mu \|^2 - 2 x_i \cdot \mu \\ &= \ell \| \mu \|^2 + \sum_{i = 1}^\ell \| x_i \|^2 - 2 \sum_{i = 1}^\ell x_i \cdot \mu \\ &= \ell \| \mu \|^2 + \sum_{i = 1}^\ell \| x_i \|^2 - 2 \left( \sum_{i = 1}^\ell x_i \right) \cdot \mu \\ &= \ell \| \mu \|^2 + \sum_{i = 1}^\ell \| x_i \|^2 - 2 \left( \ell \mu \right) \cdot \mu \\ &= - \ell \| \mu \|^2 + \sum_{i = 1}^\ell \| x_i \|^2 \end{align*}$$ The right side can be rewritten as $$\begin{align*} \sum_{x \neq y \in S} (x - \mu) \cdot (\mu - y) &= \sum_{i = 1}^\ell \left(- (x_i - \mu) \cdot (\mu - x_i) + \sum_{j = 1}^\ell (x_i - \mu) \cdot (\mu - x_j) \right) \\ &= \sum_{i = 1}^\ell \left( \| x_i - \mu \|^2 + \sum_{j = 1}^\ell (x_i - \mu) \cdot (\mu - x_j) \right) \\ &= \sum_{i = 1}^\ell \left( \| x_i - \mu \|^2 + \sum_{j = 1}^\ell x_i \cdot \mu - \mu \cdot x_j - \| \mu \|^2 - x_i \cdot x_j \right) \\ &= \sum_{x \in S} \| x - \mu \|^2 + \sum_{i = 1}^\ell \sum_{j = 1}^\ell x_i \cdot \mu - \mu \cdot x_j - \| \mu \|^2 - x_i \cdot x_j \end{align*}$$ So if the two sides are equal, then $\sum_{i = 1}^\ell \sum_{j = 1}^\ell x_i \cdot \mu - \mu \cdot x_j - \| \mu \|^2 - x_i \cdot x_j = \sum_{i, j} (x_i - \mu) \cdot (\mu - x_j) = 0$? That doesn't sound like it's always true.