Derivation of within point scatter $W(C)$

738 Views Asked by At

I'm reading the book "The Elements of statistical learning". In the section about K-means clustering they derive an equation regarding the "within point scatter" which is a quantity that describes how "scattered" points are within a cluster.

\begin{aligned} W(C) &=\frac{1}{2} \sum_{k=1}^{K} \sum_{C(i)=k} \sum_{C\left(i^{\prime}\right)=k}\left\|x_{i}-x_{i^{\prime}}\right\|^{2} \\ &=\sum_{k=1}^{K} N_{k} \sum_{C(i)=k}\left\|x_{i}-\bar{x}_{k}\right\|^{2} \end{aligned}

where

$N_{k}=\sum_{i=1}^{N} I(C(i)=k)$,

$\bar{x}_{k}=\left(\bar{x}_{1 k}, \ldots, \bar{x}_{p k}\right)$

and $C(i)$ is an encoder that assigns each observation to one of $k$ clusters. Each observation $i$ can have up to $p$ features. This means that $\sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2}=\left\|x_{i}-x_{i^{\prime}}\right\|^{2}$.

In the above equation I don't understand how they conclude the result containing $\bar{x}_{k}$. I tried to just calculate it by "brute force" but the indicator function $I(C(i)=k)$ and the vanishing of the $1/2$ before the first some confuse me. What's a simple way to derive the result ?

1

There are 1 best solutions below

0
On BEST ANSWER

A common trick is add and subtract the same term, here I add and subtract $\bar{x_k}$ and view the norm as an inner product.

\begin{align} &\frac12 \sum_{k=1}^K \sum_{C(i)=k} \sum_{C(i')=k} \|x_i-x_{i'}\|^2 \\&= \frac12 \sum_{k=1}^K \sum_{C(i)=k} \sum_{C(i')=k} \|x_i - \bar{x}_k - (x_{i'}-\bar{x}_k)\|^2 \\ &= \frac12 \sum_{k=1}^K \sum_{C(i)=k} \sum_{C(i')=k} \left(\|x_i - \bar{x}_k\|^2 + \|x_{i'}-\bar{x}_k\|^2 - 2\langle x_i - \bar{x}_k,x_{i'}-\bar{x}_k\rangle\right) \\ &= \frac12 \sum_{k=1}^K \left(N_k\sum_{C(i)=k} \|x_i - \bar{x}_k\|^2 + N_k\sum_{C(i')=k}\|x_{i'}-\bar{x}_k\|^2 - 2\langle \sum_{C(i')=k}\left(x_{i'}-\bar{x}_k\right),\sum_{C(i')=k}\left(x_{i'}-\bar{x}_k\right)\rangle\right) \\ &= \sum_{k=1}^K N_k\sum_{C(i)=k}\|x_i - \bar{x}_k\|^2 \end{align}