We can find a really good explanation about the link between:
- finding the subspace upon which the projected data cloud has the maximal variance (or "inertie")
- finding the subspace upon which the distance between the projected data cloud and the original data cloud is minimal (we use Pythagoras to show this equivalence and it results that the projection is orthogonal)
However there is a result from 1936 by Eckart and Young that states the following
$$\sum_1^r d_k u_k v_k^T = \arg \min_{\hat{X} \in M(r)} \| X - \hat{X} \|_F^2$$ where $M(r)$ is the set of rank-$r$ matrices, which basically means first $r$ components of the SVD of $X$ gives the best low-rank matrix approximation of $X$ and best is defined in terms of the squared Frobenius norm - the sum of squared elements of a matrix.
This is a general result for matrices and at first sight has nothing to do with data sets or dimensionality reduction (but it is related, in fact). So, I would really appreciate if you could explain (and prove) the link with the above. Thanks.
Suppose that $X$ is the matrix whose columns are the "data", and let $\hat X$ be the matrix attained by projecting all columns onto some subspace.
First, consider any column $x$ of $X$, and take $\hat x$ to be the corresponding projected column. By the nature of an orthogonal projection, it should be clear that $$ \|x\|^2 = \|\hat x\|^2 + \|x - \hat x\|^2 $$ Noting that this is the case for all columns of $X$, we see that $$ \|X\|_F^2 = \|\hat X\|_F^2 + \|X - \hat X\|_F^2 \tag{*} $$ Moreover, since the columns of $X$ have been appropriately translated, we find the variance of the projected data (which will have mean $0$) is $\|\hat X\|_F^2$. With all that said, the observation that maximizing the variance of projected data is equivalent to minimizing the total distance from the original to its projection is precisely equivalent to the statement that $$ \arg \max_{\hat X \in M(r)} \|\hat X\|_F = \arg \min_{\hat X \in M(r)} \|X - \hat X\|_F $$ and it is clear from $(^*)$ that this is the case.