A Data Recovery proof on K-means algorithm objective function

86 Views Asked by At

There are quite a few materials about the K-means algorithm, but what I need is proof of how its objective function is derived based on the so-called data recovery approach. Ref. for this approach: B. Mirkin, Clustering: A Data Recovery Approach, CRC Press (1st Edition, 2005; 2d Edition, 2012). The book can be downloaded freely: http://www.dcs.bbk.ac.uk/~mirkin/papers/n.pdf

My question:

Given an entity-to-feature matrix $Y = (y_{iv}) \in \mathbb R^{N}$ and given number of clusters K. The goal is to cluster Y into K disjoint sets, that is, $S = \{S_1, S_2, ...., S_K \}$. And the cluster $S_{k}$ is represented by a $N \times 1$ column vector $s_{k} = (s_{ik})$, where $s_{ik}=1$ if $i \in S_{k}$ and $s_{ik}=0$, otherwise ($i\in I$).

To related the notion of members of the clusters having similar feature values, we consider the standard $V$-dimensional point of $S_{k}$, $c_{k}=(c_{k v})$ ($v=1, ..., V, k=1, ..., K$).

Then, as it is described on page 288:

\begin{equation} \label{e1} y_{iv} = c_{kv} + e_{iv}; \ i \in S_{k} \ \ \ (7.18) \end{equation}

As the author wrote: multiplying equations (7.18) by themselves and summing up the results, it is not difficult to derive the following equation:

\begin{equation} \label{eqn:y_iv_sqr_sum} \sum_{i,v} y_{iv}^{2} = \sum_{k=1}^{K} \sum_{v} c_{kv}^{2} |S_{k}| + \sum_{i,v} e_{iv}^{2} \ \ \ \ (7.19) \end{equation}

Now my question is whether the textbook's proof is correct or not? Because I obtain something which slightly differs from it, I have the coefficient $K$ multiplied with the LHS and last term of RHS.

Here is my proof

defining the model as:

\begin{equation} \label{e:2} y_{iv} = c_{kv} s_{ik} + e_{iv} \ \ (i=1, ..., N, v \in V, k=1, ..., K) \end{equation}

Rearranging, squaring and summing over all clusters, all data points and all features implies:

\begin{equation} \sum_{k=1}^{K} \sum_{i,v} e_{i v}^{2} = \sum_{k=1}^{K} \sum_{i, v}(y_{i v} - c_{k v} s_{i k})^{2} \end{equation}

Furthermore, we have:

\begin{equation} \begin{split} K \sum_{i,v} e_{i v}^{2} = \sum_{k=1}^{K} \sum_{i, v} (y_{i v}^{2} - 2 y_{iv} c_{k v} s_{i k} + c_{k v}^{2} s_{i k}^{2}) \end{split} \end{equation}

Recalling the definition of the mean:

\begin{equation} c_{k v} = \frac{\sum_{i} y_{iv} s_{ik}}{|S_{k}|}=\frac{\sum_{i\in S_{k}}y_{iv}}{|S_{k}|} \label{eqn:def_ck_same_mdl} \end{equation}

And by substituting it in its above-mentioned Eqn. and little algebric manipulation we will obtain:

\begin{equation} K \sum_{i,v} y_{iv}^{2} = \sum_{k=1}^{K} \sum_{ v} c_{kv}^{2} |S_{k}| + K \sum_{i,v} e_{iv}^{2} \ \ \ \ (*) \end{equation}

The above Eqn. holds because:

  1. $\sum_{i} s_{ik}^2 = \sum_{i} s_{ik} = |S_{k}|$ since $s_k$ is a N-dimensional binary vector;
  2. for a given $k$ recalling the definition of the mean, $ c_{k v} = \frac {1}{|S_{k}|} \sum_{i \in S_{k}} y_{i v} $, we find $\sum_{v} c_{k v} \sum_{i \in S_{k}} y_{i v}= \sum_{v} c_{k v} c_{k v} |S_{k}| = |S_{k}| \sum_{v} c_{k v}^2$.

Am I right? (Because in the book there is no such a $K$ as I obtained Eqn. (*))

Thanks in advance!

1

There are 1 best solutions below

0
On BEST ANSWER

After spending several hours I finally managed to prove it properly so that it is compatible with the proof in the textbook. (Although I will not close this question for a while so that probably someone wants to provide another answer)

Here is my proof:

Since we have $K$ non-overlapping clusters therefore, we can define our model as:

\begin{equation} \label{model_f} y_{iv} = \sum_{k=1}^{K} c_{kv}s_{ik} + e_{iv}, \end{equation}

Rearranging and squaring implies:

\begin{equation} \label{model_f_sqr} e_{iv}^2 = (y_{iv} - \sum_{k=1}^{K} c_{kv}s_{ik})^2, \end{equation}

opening the parenthesis yields:

\begin{equation} \label{model_f_sqr_opn} e_{iv}^2 = y_{iv}^2 - 2 y_{iv} \sum_{k=1}^{K} c_{kv}s_{ik} + \sum_{k=1}^{K} c_{kv}^2 s_{ik}^2, \end{equation}

By summing over all entities, all features and, rearranging a bit:

\begin{equation} \label{model_f_sqr_opn_sums} \sum_{i,v} e_{iv}^2 = \sum_{i,v} y_{iv}^2 - 2 \sum_{i,v} \sum_{k=1}^{K} y_{iv} s_{ik} c_{kv} + \sum_{i,v} \sum_{k=1}^{K} c_{kv}^2 s_{ik}^2. \end{equation}

Now we should recall two facts:

  1. definition of the average of a cluster, $k$ over a feature, $v$: $c_{kv} = \frac{\sum_{i}y_{iv}s_{ik}}{N_{k}}$. Where $N_{k}$ is the cluster cardinality. Therefore we have $\sum_{i}y_{iv}s_{ik} = c_{kv} N_{k}$

  2. Since the cluster membership vectors ($s_{k}$) are binary, thus: $\sum_{i} s_{ik}^2 = \sum_{i} s_{ik} = N_{k}$

Therefore,

w.r.t 1) $ -2 \sum_{i,v} \sum_{k=1}^{K} y_{iv} s_{ik} c_{kv} = \sum_{k} \sum_{v} c_{kv}^{2} N_{k}$;

And

w.r.t 2) $ \sum_{k=1}^{K} \sum_{v} \sum_{i} c_{kv}^2 s_{ik}^2 = \sum_{k=1}^{K} \sum_{v} c_{kv}^2 N_{k}$

Therefore, we have:

\begin{equation} \label{model_proof} \sum_{i,v} e_{iv}^2 = \sum_{i,v} y_{iv}^2 - \sum_{k=1}^{K} \sum_{v} c_{kv}^2 N_{k} \end{equation}

And the proof is complete.