I found an unclear part in derivation of PCA in the lecture notes of A. Bandeira for MIT Fall 2015 18.S096: Topics in Mathematics of Data Science course.
In $\S 1.1.1$ the author derives PCA as best $d$-dimensional affine fit as follows. We take data points $x_1,\ldots,x_n \in \mathbb{R}^p$ and search for their representation (as coordinates $\beta_k$) in a $d$-dimensional affine subspace defined by shift $\mu$ and orthogonal basis $V=[v_1,\ldots,v_d]$ via a least-squares fit problem: $$ \min\limits_{\mu,\beta_k,V:V^TV=I}\sum\limits_{k=1}^{n}\|x_k - (\mu + V\beta_k)\|^2_2. $$
First we try to optimize for $\mu$ and use first-order condition $$\nabla_\mu\sum\limits_{k=1}^{n}\|x_k - (\mu + V\beta_k)\|^2_2=0,$$ for which to hold we need that $$\left(\sum\limits_{k=1}^{n}x_k\right) -\mu n - V\left(\sum\limits_{k=1}^{n}\beta_k\right)=0.$$
Question:
At this point the author says that $\sum\limits_{k=1}^{n}\beta_k=0$ and goes on with the proof (which is fine), however I could find no rigorous reason why this should be true. Simple examples suggest such a fact (say, take two 2d points and fit a line to them), but I would appreciate an explanation.
I cannot comment(I have not enough point), so it comes as a solution. I am very interested also to see how your teacher conclude that, especially looks like it is evident for him. It is not clear, because it says we try to find the best d-dimnetinal affine subspace which the projections of $x_1, . . . , x_n$ on it best approximate the original points $x_1, . . . , x_n$. So we can directly deduce that $(\beta_{i})_k = v_i^T (x_k − \mu)$ Basically this is projection of $x_k-\mu$ on $v_i$. And the approximation is $\widehat{x_k}=\mu+\sum_{i} (\beta_{i})_{k }v_i$