PCA algorithm on low coverage features

87 Views Asked by At

Suppose I am using PCA on traditional user-item matrix (each user for each row, each feature for each column), and I want to use PCA to lower feature dimension and use compacted features for a two class classification problem.

Suppose some features has very low coverage (very few users has this feature, all others are None), but such features are has strong prediction power for classification (e.g. measured by mutual information). Wondering PCA algorithm will ignore such features since its coverage is very low? Thanks.

regards, Lin

1

There are 1 best solutions below

7
On

PCA will still give strong weights to these features, so long as it is evident from your data that they have a strong effect.

To be more concrete, PCA is equivalent to the eigenvalue decomposition of the Covariance Matrix $C_{xx}$ = $\frac1{n-1}*X*X^T$.

If your data implies that the data is highly variant along certain dimensions, PCA will essentially amplify this difference, providing you with an optimal coordinate system to view your data which emphasizes which coordinates play a large role in the values of your data. Your eigenvector matrix will provide your new basis.

In MATLAB:

[V, D] = eig(cov(X)) 

should do the trick, where V will yield your new coordinates (Principal Components)

I'm sure Numpy or Scipy has some equivalent functionality that I'm not aware of.