The standard explanation to linear PCA begins with the covariance matrix.
That is, for a dataset $D$ of dimension $N \times d$, the covariance matrix is given as $\sum = \frac{D^{T}D}{N}$ where the eigenvectors and corresponding eigenvalues are determined.
Then, $D' = DP$ where $P$ is the basis matrix whose columns are eigenvectors corresponding to the covariance matrix $\sum$ of $D$.
But to my understanding, the aforementioned $D^{T}D$ is the dot product between matrix $D^{T}$ and $D$.
$D^{T}D$ may be expressed as $S = DD^{T} = K(X_{i}, X_{j}) = \phi(X_{i}).\phi(X_{j}) = X_{i}.X_{j}$ where the mapping function $\phi(X_{i}).\phi(X_{j})$ is not generally computed. Indeed, the linear PCA is just a special case of the Kernel PCA.
The text I am reading says
- Construct the positive semi - definite $N \times N$ kernel similarity function S using an off - the - shelf kernel function or other similarity matrix computation.
Understanding: Am I correct to think that $S = K(X_{i},X_{j})$ where, say, $K(X_{i},X_{j}) = (X_{i}.X_{j}+c)^{h}$?
- Diagonalise the similarity matrix $S = Q \Omega^{2}Q^{T}$ and set the new $N \times k$ embedding $D'$ to the first $k$ columns of $Q\Omega$ corresponding to the largest eigenvalues. The default assumption is to set $k$ so that all non - zero eigenvectors of $S$ are included. Note that $k$ can be larger than $d$ in the non - linear setting.
Understanding: I am experiencing difficulties trying to understand the author here. Any help is appreciated.