Linear Algebra Principal Component Analysis Optimization

626 Views Asked by At

I'm currently working my way through a textbook showing the derivation for principal component analysis algorithm. However, right at the end of the section, when we get the equation:

= $\underset{d}{\operatorname{arg max}}$ Tr(d$^T$X$^T$Xd) subject to d$^T$d = $1$.

It follows with:

This optimization problem may be solved using eigendecomposition. Specifically, the optimal d is given by the eigenvector of X$^T$X corresponding to the largest eigenvalue.

I don't follow why this is the case.

In case the matrix size matters here, X is an m by n matrix, and d is an n dimensional vector.

2

There are 2 best solutions below

2
On BEST ANSWER

As given, one can write the lagrangian for the optimization problem as

$$ \mathcal{L}(d, \lambda) = d^TX^TXd + \lambda(d^Td - 1) $$

by the usual KKT conditions (which are necessary, but not sufficient), we know that we must have $\nabla \mathcal{L} = 0$, that is

$$ \nabla \mathcal{L} = \nabla(d^TX^TXd + \lambda(d^Td - 1)) = 2X^TXd +2\lambda d = 0 $$ or that $$ X^TXd = - \lambda d = \lambda' d $$ which is the usual eigenvalue equation (where $\lambda' = -\lambda$ is defined for convenience). The fact that the largest such $\lambda'$ maximizes the original optimization problem is left as an exercise for the reader!


A second, more elementary and enlightening proof not relying on KKT, is the following.

Note that $X^TX$ is positive semi-definite, hence the eigenvectors are (or, if singular, can be made) orthogonal and span the space. This means that, for any unit vector $v$ we have, where $X^TXu_i = \lambda_i u_i$ and $u_i^2 = 1$

$$ v = \sum_i v^i u_i $$

and $\sum_i(v^i)^2 = 1$, where $v^i = u_i^Tv$. The idea here is then to express the $v$ in the basis of eigenvectors of $X^TX$ which preserves the orthogonality of the basis and scales it by $\text{diag}(\lambda_i)$ such that the final result is separable over the $v^i$ ($v$ expressed over the new basis) and makes it clear that we'd like to pick the largest eigenvector available.

Now, plugging this in to the original expression gives

$$ \begin{align} v^TX^TX v &= \left(\sum_i v^i u_i\right)^TX^TX\left(\sum_j v^j u_j\right) \\ &= \left(\sum_i v^i u_i\right)^T\left(\sum_j v^j X^TX u_j\right)\\ &= \left(\sum_i v^i u_i\right)^T\left(\sum_j \lambda_j v^j u_j\right) \\ &= \sum_{ij} v^i v^j \lambda_j u_i^Tv_j \end{align} $$

but since all $u_i$ are orthogonal and have unit length, all cross-terms cancel and $u_i^2 = 1$, thus

$$ \sum_{ij} v^i v^j \lambda_j u_i^Tv_j = \sum_i (v^i)^2\lambda_i = v^TX^TX v. $$

This objective is clearly separable over the $v^i$, so it's maximized when we make the term containing the greatest $\lambda_i$ the maximal term. Of course, note that the constraint is non-linear so this requires a (relatively simple) proof which I leave for the reader.

0
On

Without invoking KKT or Lagrangian Multipliers, you could simply recast the problem to an unconstrained form and solve it.

Consider a vector $y$ of length $\gamma={\sqrt {y^Ty}}\,\,\,$ and note how varying the vector affects its length $$\eqalign{ \gamma^2 &= y^Ty \cr \gamma\,d\gamma &= y^T\,dy \cr d\gamma &= \gamma^{-1}y^T\,dy \cr }$$ It's easy to construct a normalized vector: $\,\,\,n=\gamma^{-1}y$
In order to write the objective function a bit more succinctly, let $w=Xy$

Find the differential and gradient of the objective function $$\eqalign{ \lambda &= \gamma^{-2}\,w^Tw \cr d\lambda &= 2\gamma^{-2}w^T\,dw - 2\gamma^{-3}w^Tw\,d\gamma \cr &= 2\gamma^{-2}w^T\,dw - 2\lambda\gamma^{-1}\,d\gamma \cr &= 2\gamma^{-2}y^TX^TX\,dy - 2\lambda\gamma^{-2}\,y^T\,dy \cr &= 2\gamma^{-2}\Big(X^TXy - \lambda y\Big)^T\,dy \cr \frac{\partial\lambda}{\partial y} &= 2\gamma^{-2} (X^TXy - \lambda y) \cr }$$ Setting the gradient to zero leaves you with an eigenvalue equation, which you can write in terms of either the normalized or denormalized vector $$\eqalign{ X^TX\,y = \lambda y \cr X^TX\,n = \lambda n \cr }$$