In Projected Variance Maximisation approach to PCA, given orthogonal projection $P_U$ of unlabelled data into the space spanned by the columns of U, s.t projection of the data point $x^{(i)}$ is given by: $P_Ux^{(i)}=U(U^TU)^{-1}U^Tx$.
How could we prove that PCA leads to an optimisation problem which can be solved by finding the stationary points associated with the following Lagrangian:
$tr(U^TSU) + tr(H(I-U^TU))$.
And in so doing how can we show that: $S = \frac{1}{n}X^TX$ and $[H] = \begin{cases} \lambda^{[i,i]},& i=j, \\ \frac{\lambda^{[i,j]}}{2} & i \not = j, \end{cases}$
I understand that In ordinary PCA setting, where we only require orthonormality: $u^{[i]} \cdot u^{[j]} = \delta_{ij}$, we compute sample variance of the projected data using:
$\displaystyle \frac{1}{n} \sum^n_{i=1} \left( u^{[j]} \cdot x^{(i)} - u^{[j]} \cdot \hat{x} \right) = u^{[j]T}Su^{[j]}$
Where $\hat x$ is sample mean and $S$ is sample covariance matrix $\frac{1}{n}X^TX$.
With this we form Lagrangian $\displaystyle argmax_{ \{ u^{[j]} \}^d_{j=1}} L$ where $ \displaystyle L = \sum^d_{j=1} u^{[j]T}Su^{[j]}$ subject to orthormal $ \{ u^{[j]} \}^d_{j=1}$.
Therefore obtaining first part of the equation $tr(U^TSU)$ sort of makes sense, showing that $S = \frac{1}{n}X^TX$ is not a problem either. I am confused about the second part of the equation.
Could anyone clarify this please?