I am working on a presentation on Spectral relaxation of k-means clustering (http://papers.nips.cc/paper/1992-spectral-relaxation-for-k-means-clustering.pdf) and I am a bit stuck.
I understand Section 2 - they show that for k-means we need to minimize the sum of squares function and that, in fact, problem can be transformed into maximizing $trace(X^TA^TAX)$ where $X$ has special orthonormal form. We relax the problem and say that $X$ is just a orthonormal matrix without special form.
Then Ky-Fan theorem states that the maximum of $trace(X^THX)$ where $H$ is symmetric matrix is equal to sum of largest $k$ eigenvalues of $H$ and that optimal $X$ is equal to $[u_{1},...,u_{k}]Q$ where $Q$ is arbitrary orthogonal matrix.
First thing that I am not sure is why is he now (at the end of Section 2) making $X$ to be a matrix of $k$ largest eigenvectors from $A^TA$ and why can we now continue with that matrix?
Now in section 3 he uses Davis-Kahan theorem and gets that $$X_{k} = Y_{k}V$$(ignoring the error) where $V$ is $k-by-k$ orthogonal matrix.
Second (and more important) thing that I don't understand is now the next paragraph (from this last formula up to the beginning section 4). They say that the idea from section 3 can be implemented by using a column of $X^T_{k}$ with largest norm, orthogonalizing other columns against that column and then looking what vectors have small norm and so on. I am not sure why can we do that and how come did they get to that?
Also, how come it turns out that this can be boiled down to $QR$-decomposition and what are we doing now with this matrices $R$ and $\hat{R}$?