This question is based on the Multiclass Linear Discriminant Analysis (MLDA) describe in Lectures slides by Olga Veksler, which is a generalization of Fisher's Linear Discriminant. My use in MLDA is to reduce dimensionality and transform multi-dimensional data into "well separated" clusters in lower dimensions.
Short review of the theory and mathematical formalism:
Suppose we have $N$ samples in $C$ classes or clusters. Each have $N_c$ samples which are $d$-dimensional, that is $x^{(c)}_{i} \in \mathbb{R}^d$ for $c = 1,...,C$ and $i = 1,...,N_c$. Each class $\omega_c$ have mean and covariance: $$ \mu_c = \frac{1}{N_c} \sum_{x_i \in \omega_c} x_i, \qquad S_c = \frac{1}{N_c}\sum_{x_i \in \omega_c} (x_i - \mu_c) (x_i - \mu_c)^T $$ The "average" scatter within classes is defined as $$ S_w = \sum_{c=1}^C S_c \in \mathbb{R}^{d \times d} $$ and the scatter between classes is defines as $$ S_b = \sum_{c=1}^{C} N_c (\mu_c - \mu)(\mu_c - \mu)^T \in \mathbb{R}^{d \times d} $$ where $\mu = (1/N)\sum_{j=1}^n x_j$ is the mean over all the samples. In order to find the best separating dimensionality reduction, given by projection $y = W^T x$, we want to maximize the following objective function: $$ J(W) = \frac{\det(W^T S_b W)}{\det(W^T S_w W)} $$ The idea is to maximize the scatter between while minimizing the scatter within each class. Differentiating the objective function with respect to $W$ one gets the following generalized eigenvalue problem: $$ S_b W_c = J(W_c) S_w W_c = \lambda S_w W_c $$ where $J(W_c) \in \mathbb{R}$ is a scalar eigenvalue. From matrix rank consideration one can show that there are at most $C-1$ non-zero eigenvectors. The projection weights are the $C-1$ eigenvectors corresponding to the $C-1$ largest eigenvalues. We stack the eigenvectors in a matrix and get the weight matrix $$ W = \left[ W_1 | ... | W_{C-1} \right] \in \mathbb{R}^{d \times (C-1)} $$ Then the projection is $y = W^T x$ (verify dimensions: $(C-1) \times 1 = ((C-1) \times d) \cdot (d \times 1)$).
This projection is called LDA or MDA.
The new means after LDA are given by $\tilde{\mu}_c = W^T \mu_c$ and the covariances after LDA are given by $$ \tilde{S}_c = W^T S_c W \in \mathbb{R}^{(C-1) \times (C-1)} $$
Here I have few questions:
- Is the after-LDA covariance matrix of each class, $\tilde{S}_c$ is diagonal? The $W^T S_c W$ is sort of a whitening transformation and if $W^T = W^{-1}$ it is diagonalization.
- How to solve the generalized eigenvalues problem $S_b v = \lambda S_w v$? If I tried different algorithms in Python such as SciPy's
eigoreighfunctions withD, V = eig(S_b, S_w), or inverting $S_w$ and then solve $S_r v = S_w^{-1} S_b v = \lambda v$ andD, V = eigh(S_r), or trying SVD:V, D, Vh = svd(S_r)I get different eigenvalues and eigenvactors. - How to use SVD to solve the generalized eigenvalues problem?
- If I use
sklearnLDA class I also get different results, but the algorithm therein is more complicated and include shrinkage covariance estimator, and priors terms.
The best results were obtained with
from scipy.linalg import pinv, eig, eigh
# Solving for eigenvectors by inverting cov_within
inv_cov_within = pinv(cov_within)
D, V = eigh(inv_cov_within @ cov_between)
# Sort according to eigenvalues in descending order
idx = np.argsort(np.real(D))[::-1]
D = D[idx]
W = V[:, idx]
# The weights are (n_features, n_classes-1)
weights = W[:, 0:n_classes-1]
I used pinv because the scatter within may not be invertible (e.g. when $d >> N$, more dimensions than samples), so I use pseudo-inverse. Using SVD may also solve this issue (how?).
Hypothesis for 1: The class covariances are diagonal only if $\forall c : S_c = S$, i.e. all the classes have the same covariance matrix.
References: