I have come across a theorem that states, that the $d$-dimensional subspace found by PCA is the optimal approximation of a probability space with such a plane, in the sense that it minimises the exspectation of the squared orthogonal distance.
I could not find a proof of that fact, neither in the book on PCA by Jolliffe nor in standard statistic textbooks. However, I found some sketches for proving the theorem in a discrete setting, where there are finitely many points that should be approximated by a $d$-dimensional plane. I didn't see how to transfer these results, though.
I would highly appreciate any hints on literature or general comments for the version with probability spaces.
This is my version of the actual theorem:
Let $\Omega \subset R^n$ and $(\Omega, A, \mu)$ be a probability space. Then define the mean and covariance matrix
$$m = \int_{\Omega} x\ d\mu = E[Id] \in R^n\\ cov = E[(Id-c)(Id-c)^T] \in R^{n,n}$$
Now consider the orthogonal eigen-decomposition $$ cov = \Phi \Sigma \Phi^T,$$ where $\Phi \in R^{n,n}$ is an orthogonal matrix and $\Sigma = \text{diag}(\lambda_1,...,\lambda_n)$ diagonal with $\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_n \ge 0$. Let $v_i \in R^n$ be the columns of $\Phi$. For $d \in N$, $d < n$ let $W$ be the $d$-dimensional affine subspace $$W := \text{span}(v_1,...,v_d) + m.$$
Then the following holds: $$W = \underset{\Pi}{\text{argmin}} \int_{\Omega} \| x-P_{\Pi}(x) \|^2 d\mu(x) = \underset{\Pi}{\text{argmin}} \, E[\| x-P_{\Pi}(x) \|^2],$$ where $\Pi$ is the set of all $d$-dimensional affine subspaces.
Thanks in advance!