In Deep Learning (Goodfellow, et al), the optimization objective of PCA is formulated as
$$D^* = \arg\min_D ||X - XDD^T||_F^2 \quad \text{s.t.} \quad D^T D=I$$
The book gives the proof of the $1$-dimension case, i.e.
$$\arg\min_{d} || X - X dd^T||_F^2 \quad \text{s.t.} \quad d^T d = 1$$
equals the eigenvector of $X^TX$ with the largest eigenvalue. And the author says the general case (when $D$ is an $m \times l$ matrix, where $l>1$) can be easily proved by induction.
Could anyone please show me how I can prove that using induction?
I know that when $D^T D = I$:
$$D^* = \arg\min_D ||X - XDD^T||_F^2 = \arg\min_D tr D^T X^T X D$$
and
$$tr D^T X^T X D = \left(\sum_{i=1}^{l-1} \left(d^{(i)}\right)^T X^TX d^{(i)}\right) + \left(d^{(l)}\right)^T X^TX d^{(l)}$$
where the left-hand side of the addition reaches maximum when $d^{(i)}$ is the $ith$ largest eigenvector of $X^T X$ according to induction hypothesis. But how can I be sure that the result of the addition in a whole is also maximal?
I am at the same point actually, and wondering if you found an answer.
My idea was to use the Theorem of Fan which says:
Let $X \in S^n$ be a symmetric matrix with eigenvalues $\lambda_1 \geqslant \ldots \geqslant \lambda_k$, then
$\lambda_1+ \cdots + \lambda_k = max \{Tr(XDD^T): D \in \mathbb{R}^{n \times k} D^TD=I_k\}$