I'm currently working my way through a textbook showing the derivation for principal component analysis algorithm. However, right at the end of the section, when we get the equation:
= $\underset{d}{\operatorname{arg max}}$ Tr(d$^T$X$^T$Xd) subject to d$^T$d = $1$.
It follows with:
This optimization problem may be solved using eigendecomposition. Specifically, the optimal d is given by the eigenvector of X$^T$X corresponding to the largest eigenvalue.
I don't follow why this is the case.
In case the matrix size matters here, X is an m by n matrix, and d is an n dimensional vector.
As given, one can write the lagrangian for the optimization problem as
$$ \mathcal{L}(d, \lambda) = d^TX^TXd + \lambda(d^Td - 1) $$
by the usual KKT conditions (which are necessary, but not sufficient), we know that we must have $\nabla \mathcal{L} = 0$, that is
$$ \nabla \mathcal{L} = \nabla(d^TX^TXd + \lambda(d^Td - 1)) = 2X^TXd +2\lambda d = 0 $$ or that $$ X^TXd = - \lambda d = \lambda' d $$ which is the usual eigenvalue equation (where $\lambda' = -\lambda$ is defined for convenience). The fact that the largest such $\lambda'$ maximizes the original optimization problem is left as an exercise for the reader!
A second, more elementary and enlightening proof not relying on KKT, is the following.
Note that $X^TX$ is positive semi-definite, hence the eigenvectors are (or, if singular, can be made) orthogonal and span the space. This means that, for any unit vector $v$ we have, where $X^TXu_i = \lambda_i u_i$ and $u_i^2 = 1$
$$ v = \sum_i v^i u_i $$
and $\sum_i(v^i)^2 = 1$, where $v^i = u_i^Tv$. The idea here is then to express the $v$ in the basis of eigenvectors of $X^TX$ which preserves the orthogonality of the basis and scales it by $\text{diag}(\lambda_i)$ such that the final result is separable over the $v^i$ ($v$ expressed over the new basis) and makes it clear that we'd like to pick the largest eigenvector available.
Now, plugging this in to the original expression gives
$$ \begin{align} v^TX^TX v &= \left(\sum_i v^i u_i\right)^TX^TX\left(\sum_j v^j u_j\right) \\ &= \left(\sum_i v^i u_i\right)^T\left(\sum_j v^j X^TX u_j\right)\\ &= \left(\sum_i v^i u_i\right)^T\left(\sum_j \lambda_j v^j u_j\right) \\ &= \sum_{ij} v^i v^j \lambda_j u_i^Tv_j \end{align} $$
but since all $u_i$ are orthogonal and have unit length, all cross-terms cancel and $u_i^2 = 1$, thus
$$ \sum_{ij} v^i v^j \lambda_j u_i^Tv_j = \sum_i (v^i)^2\lambda_i = v^TX^TX v. $$
This objective is clearly separable over the $v^i$, so it's maximized when we make the term containing the greatest $\lambda_i$ the maximal term. Of course, note that the constraint is non-linear so this requires a (relatively simple) proof which I leave for the reader.