Kernel PCA similarity matrix analogy

59 Views Asked by At

The standard explanation to linear PCA begins with the covariance matrix.

That is, for a dataset $D$ of dimension $N \times d$, the covariance matrix is given as $\sum = \frac{D^{T}D}{N}$ where the eigenvectors and corresponding eigenvalues are determined.

Then, $D' = DP$ where $P$ is the basis matrix whose columns are eigenvectors corresponding to the covariance matrix $\sum$ of $D$.

But to my understanding, the aforementioned $D^{T}D$ is the dot product between matrix $D^{T}$ and $D$.

$D^{T}D$ may be expressed as $S = DD^{T} = K(X_{i}, X_{j}) = \phi(X_{i}).\phi(X_{j}) = X_{i}.X_{j}$ where the mapping function $\phi(X_{i}).\phi(X_{j})$ is not generally computed. Indeed, the linear PCA is just a special case of the Kernel PCA.

The text I am reading says

  1. Construct the positive semi - definite $N \times N$ kernel similarity function S using an off - the - shelf kernel function or other similarity matrix computation.

Understanding: Am I correct to think that $S = K(X_{i},X_{j})$ where, say, $K(X_{i},X_{j}) = (X_{i}.X_{j}+c)^{h}$?

  1. Diagonalise the similarity matrix $S = Q \Omega^{2}Q^{T}$ and set the new $N \times k$ embedding $D'$ to the first $k$ columns of $Q\Omega$ corresponding to the largest eigenvalues. The default assumption is to set $k$ so that all non - zero eigenvectors of $S$ are included. Note that $k$ can be larger than $d$ in the non - linear setting.

Understanding: I am experiencing difficulties trying to understand the author here. Any help is appreciated.