clusterization with minimization of the SSE

96 Views Asked by At

I have found the following document in which it explains the minimization of sum of squared errors applied to clusterization. An extract of the book is the following:

enter image description here

Actually I am having some problems interpreting the demonstration procedure shown in the last figure, so far what I get is the following:

  • In the first line I am obtaining the squared error of each ck, which is the mean of each cluster, related to every point xi; and this is extended to all the clusters for k=1 to K.
  • Next, I believe that is applied a partial derivative related to cj, is that related because I want to check when the function reaches an optimum value given cj?
    • In the third line I suppose that the sum of k disappears, because I can simplify it by considering only one k cluster.
    • The last line is a little bit obscure for me, from where do I get that the summation of de x that belongs to Cj is equal to |Cj|cj, any help?

Thank you for your clarifications in this procedure.

1

There are 1 best solutions below

2
On BEST ANSWER
  • Given $C_1, C_2, \ldots, C_k$, their goal is to find $c_1, \ldots, c_k$ such that $SSE(C)$ is minimized.

  • Sum of $k$ disappears because $c_j$ only appears in one cluster, the other $c_k, k \neq j$ is viewewd as a constant when we differentiate with respect to $c_j$.

$$\sum_{x \in C_j}2(c_j-x_i)=0$$ $$\sum_{x \in C_j}(c_j-x_i)=0$$ $$\sum_{x \in C_j}c_j-\sum_{x \in C_j}x_i=0$$ $$c_j\sum_{x \in C_j}1-\sum_{x \in C_j}x_i=0$$ $$|C_j|c_j-\sum_{x \in C_j}x_i=0$$

$$|C_j|c_j=\sum_{x \in C_j}x_i$$