There are quite a few materials about the K-means algorithm, but what I need is proof of how its objective function is derived based on the so-called data recovery approach. Ref. for this approach: B. Mirkin, Clustering: A Data Recovery Approach, CRC Press (1st Edition, 2005; 2d Edition, 2012). The book can be downloaded freely: http://www.dcs.bbk.ac.uk/~mirkin/papers/n.pdf
My question:
Given an entity-to-feature matrix $Y = (y_{iv}) \in \mathbb R^{N}$ and given number of clusters K. The goal is to cluster Y into K disjoint sets, that is, $S = \{S_1, S_2, ...., S_K \}$. And the cluster $S_{k}$ is represented by a $N \times 1$ column vector $s_{k} = (s_{ik})$, where $s_{ik}=1$ if $i \in S_{k}$ and $s_{ik}=0$, otherwise ($i\in I$).
To related the notion of members of the clusters having similar feature values, we consider the standard $V$-dimensional point of $S_{k}$, $c_{k}=(c_{k v})$ ($v=1, ..., V, k=1, ..., K$).
Then, as it is described on page 288:
\begin{equation} \label{e1} y_{iv} = c_{kv} + e_{iv}; \ i \in S_{k} \ \ \ (7.18) \end{equation}
As the author wrote: multiplying equations (7.18) by themselves and summing up the results, it is not difficult to derive the following equation:
\begin{equation} \label{eqn:y_iv_sqr_sum} \sum_{i,v} y_{iv}^{2} = \sum_{k=1}^{K} \sum_{v} c_{kv}^{2} |S_{k}| + \sum_{i,v} e_{iv}^{2} \ \ \ \ (7.19) \end{equation}
Now my question is whether the textbook's proof is correct or not? Because I obtain something which slightly differs from it, I have the coefficient $K$ multiplied with the LHS and last term of RHS.
Here is my proof
defining the model as:
\begin{equation} \label{e:2} y_{iv} = c_{kv} s_{ik} + e_{iv} \ \ (i=1, ..., N, v \in V, k=1, ..., K) \end{equation}
Rearranging, squaring and summing over all clusters, all data points and all features implies:
\begin{equation} \sum_{k=1}^{K} \sum_{i,v} e_{i v}^{2} = \sum_{k=1}^{K} \sum_{i, v}(y_{i v} - c_{k v} s_{i k})^{2} \end{equation}
Furthermore, we have:
\begin{equation} \begin{split} K \sum_{i,v} e_{i v}^{2} = \sum_{k=1}^{K} \sum_{i, v} (y_{i v}^{2} - 2 y_{iv} c_{k v} s_{i k} + c_{k v}^{2} s_{i k}^{2}) \end{split} \end{equation}
Recalling the definition of the mean:
\begin{equation} c_{k v} = \frac{\sum_{i} y_{iv} s_{ik}}{|S_{k}|}=\frac{\sum_{i\in S_{k}}y_{iv}}{|S_{k}|} \label{eqn:def_ck_same_mdl} \end{equation}
And by substituting it in its above-mentioned Eqn. and little algebric manipulation we will obtain:
\begin{equation} K \sum_{i,v} y_{iv}^{2} = \sum_{k=1}^{K} \sum_{ v} c_{kv}^{2} |S_{k}| + K \sum_{i,v} e_{iv}^{2} \ \ \ \ (*) \end{equation}
The above Eqn. holds because:
- $\sum_{i} s_{ik}^2 = \sum_{i} s_{ik} = |S_{k}|$ since $s_k$ is a N-dimensional binary vector;
- for a given $k$ recalling the definition of the mean, $ c_{k v} = \frac {1}{|S_{k}|} \sum_{i \in S_{k}} y_{i v} $, we find $\sum_{v} c_{k v} \sum_{i \in S_{k}} y_{i v}= \sum_{v} c_{k v} c_{k v} |S_{k}| = |S_{k}| \sum_{v} c_{k v}^2$.
Am I right? (Because in the book there is no such a $K$ as I obtained Eqn. (*))
Thanks in advance!
After spending several hours I finally managed to prove it properly so that it is compatible with the proof in the textbook. (Although I will not close this question for a while so that probably someone wants to provide another answer)
Here is my proof:
Since we have $K$ non-overlapping clusters therefore, we can define our model as:
\begin{equation} \label{model_f} y_{iv} = \sum_{k=1}^{K} c_{kv}s_{ik} + e_{iv}, \end{equation}
Rearranging and squaring implies:
\begin{equation} \label{model_f_sqr} e_{iv}^2 = (y_{iv} - \sum_{k=1}^{K} c_{kv}s_{ik})^2, \end{equation}
opening the parenthesis yields:
\begin{equation} \label{model_f_sqr_opn} e_{iv}^2 = y_{iv}^2 - 2 y_{iv} \sum_{k=1}^{K} c_{kv}s_{ik} + \sum_{k=1}^{K} c_{kv}^2 s_{ik}^2, \end{equation}
By summing over all entities, all features and, rearranging a bit:
\begin{equation} \label{model_f_sqr_opn_sums} \sum_{i,v} e_{iv}^2 = \sum_{i,v} y_{iv}^2 - 2 \sum_{i,v} \sum_{k=1}^{K} y_{iv} s_{ik} c_{kv} + \sum_{i,v} \sum_{k=1}^{K} c_{kv}^2 s_{ik}^2. \end{equation}
Now we should recall two facts:
definition of the average of a cluster, $k$ over a feature, $v$: $c_{kv} = \frac{\sum_{i}y_{iv}s_{ik}}{N_{k}}$. Where $N_{k}$ is the cluster cardinality. Therefore we have $\sum_{i}y_{iv}s_{ik} = c_{kv} N_{k}$
Since the cluster membership vectors ($s_{k}$) are binary, thus: $\sum_{i} s_{ik}^2 = \sum_{i} s_{ik} = N_{k}$
Therefore,
w.r.t 1) $ -2 \sum_{i,v} \sum_{k=1}^{K} y_{iv} s_{ik} c_{kv} = \sum_{k} \sum_{v} c_{kv}^{2} N_{k}$;
And
w.r.t 2) $ \sum_{k=1}^{K} \sum_{v} \sum_{i} c_{kv}^2 s_{ik}^2 = \sum_{k=1}^{K} \sum_{v} c_{kv}^2 N_{k}$
Therefore, we have:
\begin{equation} \label{model_proof} \sum_{i,v} e_{iv}^2 = \sum_{i,v} y_{iv}^2 - \sum_{k=1}^{K} \sum_{v} c_{kv}^2 N_{k} \end{equation}
And the proof is complete.