Please explain what this notation means in K-means clustering.

231 Views Asked by At

I am reading up on K-means clustering. I understand the procedure to an extent (I think!), but I am looking to understand the mathematical formulation better.

Given $m $ data points {$ x^1, x^2, ....., x^m $} $\in \mathbb{R^n} $

Find $k $ cluster centers {$ c^1, c^2, ....., c^k $} $\in \mathbb{R^n} $

And assign each data point i to one cluster

$\pi(i) \in $ {$ 1,2, ......,k$}

Such that, averaged square distance is minimized from each data point to respective cluster center.

$ min_{c, \pi} \frac{1}{m} \sum_{j=1}^{m} \Vert x^i - c^{\pi(i)} \Vert^2$

Now, decide cluster membership of each data point $x^i$ by assigning it to nearest cluster.

$ \pi(i) = argmin_{j=1,.....,k} \Vert x^i - c^j \Vert^2$

Now what I don't understand is the following:

To update the cluster center we use:

$c^j = \frac{1}{ \vert \{i: \pi(i)=j \} \vert } \sum_{i: \pi(i)=j}{x^i}$

I understand that we are averaging all the data points and selecting the new average as the cluster center, but how does this notation explain that? I would greatly appreciate if someone explained that to me. Thanks!

1

There are 1 best solutions below

2
On BEST ANSWER

The $\pi$ function tells you which cluster a data-point belongs to, so $\{i\colon \pi(i)=j\}$ denotes the (index) set of all points which belong to the $j$-th cluster. In particular

$$ c^j = \Big( \underbrace{\vert \{i: \pi(i)=j \} \vert}_{\substack{\text{number of data-points}\\\text{belonging to cluster j}} }\Big)^{-1} \underbrace{\sum_{i: \pi(i)=j}}_{{\substack{\text{sum over all data-points}\\\text{belonging to cluster j}}}}{x^i} $$

is the average of all data points beloning to the $j$-th cluster.