I was wondering if I could get a mathematical description of the relationship between the singular value decomposition (SVD) and the principal component analysis (PCA).
To be more specific I have some point which I don't understand very well, at least from a mathematical point of view.
What are the principal components (PCs), bearing in mind that we are using the SVD to compute the PCA?
Which part of the SVD becomes the PCs?
What is the relationship between the orthogonal matrices from the SVD plus the diagonal matrix with the scores and loadings of the PCA?
I have read that the principal components can describe big data sets with very few loadings vectors, how is this (mathematically) possible? In which way can these few principal components tell me something about the variance of a big data set and what does the SVD has to do with this process?
I have tried to be very specific making my questions, if something is not so clear I apologize.
Thanks in advance for your help!
PS I have made my homework and look very close to : What is the intuitive relationship between SVD and PCA? but I could not fine the answers I am looking for. I believe mine are more related to the mathematical concepts than the practical ones. Anyways, if you believe this questions is unnecessary (duplicate) I will remove it.
Suppose we have a bunch of large vectors $x_1,\ldots,x_N$ stored as the columns of a matrix $X$. It would be nice if we could somehow find a small number of vectors $u_1,\ldots,u_s$ such that each vector $x_i$ is (to a good approximation) equal to a linear combination of the vectors $u_1,\ldots, u_s$. This would allow us to describe each of the (very large) vectors $x_i$ using just a small number of coefficients.
So we want to find vectors $u_1,\ldots, u_s$ such that for each $x_i$ we have \begin{equation} x_i \approx c_{i,1} u_1 + c_{i,2} u_2 + \cdots + c_{i,s} u_s \end{equation} for some coefficients $c_{i,1},\ldots, c_{i,s}$.
These $N$ equations ($i$ goes from $1$ to $N$) can be combined into one single matrix equation: \begin{equation} X \approx U C \end{equation} for some matrix $C$. (Here the columns of $U$ are $u_1,\ldots, u_s$.)
Note that the rank of $UC$ is less than or equal to $s$. So $UC$ is a low rank approximation of $X$.
Here is the key fact: the SVD gives us an optimal low rank approximation of $X$ ! That is one of the basic facts about the SVD. That's why the SVD can be used for image compression.
If the SVD of $X$ is expressed as \begin{equation} X = \sum_{i=1}^N \sigma_i u_i v_i^T, \end{equation} where $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_N$, then an optimal approximation of $X$ of rank less than or equal to $s$ is \begin{align} X &\approx \sum_{i=1}^s \sigma_i u_i v_i^T \\ &= U \Sigma V^T \\ &= U C \end{align} where $U$ is the matrix with columns $u_1,\ldots, u_s$ and $C = \Sigma V^T$.
Thus, the SVD finds an optimal $U$ for us.
PCA takes as input vectors $x_1,\ldots,x_N$ as well as a small positive integer $s$. PCA demeans the vectors and stores them in the columns of a matrix $X$, then simply computes the SVD $X = U \Sigma V^T$ and returns the first $s$ columns of $U$ as output.