Math formulas on Clustering

1k Views Asked by At

I am currently studying Clustering in Machine Learning. I have found a document regarding guessing the right number of clusters. I am reading the first part of it, having difficulties in understanding the first four formulas in the red circles below in the image. I believe I have understood a little bit about it but not 100% sure.

Can somebody give me expertise in what these four formulas are saying? Thank you and I am sorry if I have placed my questions in the wrong places.

enter image description here

1

There are 1 best solutions below

3
On BEST ANSWER

Cluster $r$ is a set of points $\{x_1,x_2,\ldots, x_{n_r}\}$. To compute $D_r$, calculate the square of the Euclidean distance between every possible pair of points of cluster $r$, and add up all these numbers. This may look more familiar to you: $$D_r = \sum_{i=1}^{n_r} \sum_{i'=1}^{n_r}\|x_i-x_{i'}\|^2.$$


Then, we claim that the following identity holds ($\overline{x}:= \frac{1}{n_r}\sum x_i $ denotes the cluster mean): $$\boxed{\frac{1}{2n_r}D_r = \frac{1}{2n_r} \sum_{i=1}^{n_r} \sum_{i'=1}^{n_r}\|x_i-x_{i'}\|^2 = \sum_{i=1}^{n_r}\|x_i-\overline{x}\|^2}.$$

[The following is a very clumsy proof... if anyone has an easier way, please let me know.]

To verify this, assume for the moment that the $x_i$ are one-dimensional (real numbers). Then \begin{align*} (\overline{x}-x_i)^2&=\left(\frac{x_1+x_2+\cdots+x_{n_r}-n_r x_i}{n_r}\right)^2\\ (\overline{x}-x_i)^2&= \frac{1}{n_r^2}((x_1-x_i)+(x_2-x_i)+\cdots+(x_{n_r}-x_i))^2\\ (\overline{x}-x_i)^2&= \frac{1}{n_r^2}\left[\sum_{j\ne i}(x_j-x_i)^2+\sum_{j \ne i}\sum_{k \ne i,j}(x_j-x_i)(x_k-x_i)\right]\\ \sum_{i=1}^{n_r}(\overline{x}-x_i)^2&= \frac{1}{n_r^2}\left[\sum_{i=1}^{n_r}\sum_{j\ne i}(x_j-x_i)^2+\sum_{i=1}^{n_r}\sum_{j\ne i}\sum_{k \ne i,j}(x_j-x_i)(x_k-x_i)\right]\\ &= \frac{1}{n_r^2}\left[\sum_{i=1}^{n_r}\sum_{j \ne i}(x_j-x_i)^2+\sum_{i=1}^{n_r}\sum_{j<i}\sum_{k \ne i,j}(x_j-x_i)(x_k-x_i)+\sum_{i=1}^{n_r}\sum_{j< i}\sum_{k \ne i,j}(x_j-x_i)(x_k-x_i)\right]\\ &=\frac{1}{n_r^2}\left[\sum_{i=1}^{n_r}\sum_{j\ne i}(x_j-x_i)^2+\sum_{i=1}^{n_r}\sum_{j > i}\sum_{k \ne i,j}\left[(x_j-x_i)(x_k-x_i)+(x_i-x_j)(x_k-x_j)\right]\right]\\ &=\frac{1}{n_r^2}\left[\sum_{i=1}^{n_r}\sum_{j\ne i}(x_j-x_i)^2+\sum_{i=1}^{n_r}\sum_{j> i}\sum_{k \ne i,j}(x_i-x_j)^2\right]\\ &=\frac{1}{n_r^2}\left[\sum_{i=1}^{n_r}\sum_{j\ne i}(x_j-x_i)^2+(n_r-2)\sum_{i=1}^{n_r}\sum_{j > i}(x_i-x_j)^2\right]\\ &=\frac{1}{n_r^2}\left[2\sum_{i=1}^{n_r}\sum_{j> i}(x_j-x_i)^2+(n_r-2)\sum_{i=1}^{n_r}\sum_{j > i}(x_i-x_j)^2\right]\\ &= \frac{1}{n_r}\sum_{i=1}^{n_r}\sum_{j> i}(x_j-x_i)^2\\ &= \frac{1}{2n_r}\sum_{i=1}^{n_r}\sum_{j =1}^{n_r}(x_j-x_i)^2\\ \end{align*}

So the result holds if the points are one-dimensional. To prove the result in higher dimensions, note that since we are taking the square of the Euclidean distance, we can handle each dimension separately and add them together to arrive at the above identity.


$W_k$ is just the sum of $\frac{1}{2n_r} D_r$ over all clusters. The boxed identity above shows that $$W_k=\sum_{r=1}^k \frac{1}{2n_r} D_r=\sum_{r=1}^k \sum_{x_i \in C_r} \|x_i-\overline{x}_r\|^2,$$ (here, $\overline{x}_r$ is the mean of cluster $r$) which is what they mean by "pooled within-cluster sum of squares around the cluster means." If the number of points is fixed, then a lower $W_k$ would be a better clustering, since the points would be closer to their respective cluster means.


So you can calculate $W_k$. But how can you determine whether your $W_k$ is low or not? We need a "reference point"; this is the role of the "null reference distribution." If for example our null hypothesis is that the points are scattered uniformly, then you can compute the expectation of the value of $W_k$, which would represent a "bad" value of $W_k$, since there should be no good clustering. So, if the $W_k$ you compute for your data is very low relative to this expectation, then you have a strong reason to believe that your clustering is good. Doing this for multiple $k$ helps you choose the best $k$. (The use of $\log$ is mostly a computational convenience, you can ignore it if you just want to understand the big picture.)


I am not sure how the computation of the expectation of $\log W_k$ works. My guess is that they use Jensen's inequality to get $E[\log W_k] \le \log E[W_k]$ and then use identity $W_k = \sum_{r=1}^k \sum_{x_i \in C_r}\|x_i-\overline{x}_r\|^2$ somehow, but I may be wrong. Perhaps someone else can answer?