Optimal $\mathbf{W}$'s columns are the generalized eigenvectors in $\mathbf{S}_\text{B}\mathbf{v}_i = \lambda_i \mathbf{S}_\text{W} \mathbf{v}_i$

110 Views Asked by At

I am trying to understand Linear Discriminant Analysis (a.k.a. Fisher's discriminant analysis).

I am reading this technical report and Christopher Bishop's Pattern Recognition and Machine Learning (section 4.1.4, p.187).

Some definitions first:

  • A data point $\mathbf{x}_n$ is defined as a column vector with $M$ components.
  • $\mathbf{X}$ is defined as a $(M, N)$ matrix where columns are the data points $\mathbf{x}_n$.
  • $K$ is the number of classes.
  • $\mathbf{S}_\text{B}$: between-class covariance of the data before projection.
  • $\mathbf{S}_\text{W}$: within-class covariance of the data before projection.
  • $\mathbf{s}_\text{B}$: between-class covariance of the projected data.
  • $\mathbf{s}_\text{W}$: within-class covariance of the projected data.

The general idea in Fisher's discriminant analysis is to project the data points in $\mathbf{X}$ to a $K - 1$ space that maximizes $\mathbf{s}_\text{B}$ and minimizes $\mathbf{s}_\text{W}$.

In the technical report, the discrimination criterion is defined as

$$J(\mathbf{W}) = \frac{|\mathbf{s}_\text{B}|}{|\mathbf{s}_\text{W}|} = \frac{|\mathbf{W}^\mathsf{T}\mathbf{S}_\text{B}\mathbf{W}|}{|\mathbf{W}^\mathsf{T}\mathbf{S}_\text{W}\mathbf{W}|} = \frac{\text{det}(\mathbf{W}^\mathsf{T}\mathbf{S}_\text{B}\mathbf{W})}{\text{det}(\mathbf{W}^\mathsf{T}\mathbf{S}_\text{W}\mathbf{W})}$$

Question $(1)$ arises here: why does maximizing the ratio of the discriminants is a good criterion for maximizing $\mathbf{s}_\text{B}$ and minimizing $\mathbf{s}_\text{W}$? (there seems to be a basic linear algebra notion I am missing)

Now, right after defining this criterion, they say that the columns of an optimal $\mathbf{W}$ are the generalized eigenvectors $\mathbf{v}_i$ that correspond to the nonzero eigenvalues $\lambda_i$ in

$$\mathbf{S}_\text{B} \mathbf{v}_i = \lambda_i \mathbf{S}_\text{W} \mathbf{v}_i,$$

subject to the normalization constraint

$$\mathbf{v}_i^\mathsf{T}\mathbf{S}_\text{W}\mathbf{v}_i = 1.$$

Now arises question $(2)$: why is this true?

To find the optimal $\mathbf{W}$, I guess we need to solve $\frac{\partial J(\mathbf{W})}{\partial\mathbf{W}} = 0$ for $\mathbf{W}$. I started by first noting (from the Matrix Cookbook) that

$$\frac{\partial\text{det}(\mathbf{W}^\mathsf{T}\mathbf{A}\mathbf{W})}{\partial \mathbf{W}} = 2 \text{det}(\mathbf{W}^\mathsf{T}\mathbf{A}\mathbf{W})\mathbf{A}\mathbf{W}(\mathbf{W}^\mathsf{T}\mathbf{A}\mathbf{W})^{-1}$$

for any symmetric matrix $\mathbf{A}$ ($\mathbf{S}_\text{B}$ and $\mathbf{S}_\text{W}$ do are symmetric, since they are covariance matrices).

This leads me to the following

$$\frac{\partial J(\mathbf{W})}{\partial\mathbf{W}} = \frac{2|\mathbf{W}^\mathsf{T} \mathbf{S}_\text{B}\mathbf{W}|\{ \mathbf{S}_\text{B}\mathbf{W}(\mathbf{W}^\mathsf{T} \mathbf{S}_\text{B}\mathbf{W})^{-1} - \mathbf{S}_\text{W}\mathbf{W}(\mathbf{W}^\mathsf{T} \mathbf{S}_\text{W}\mathbf{W})^{-1} \}} {|\mathbf{W}^\mathsf{T} \mathbf{S}_\text{W}\mathbf{W}|}$$

But then I'm not sure where I am going and would have liked some insight!

2

There are 2 best solutions below

2
On

So I am back with answers to my questions, in case it can help someone who would stumble upon this. In the end, as I guessed in the question, it's all about linear algebra notions/intuitions I was missing.

  1. The discriminant of a $(n \times n)$ matrix $\mathbf{A} = [\mathbf{a}_1 \enspace \mathbf{a}_2 \enspace \cdots \enspace \mathbf{a}_n]$, where $\mathbf{a}_i$ are its column vectors, has a geometric interpretation as the squared $n$-dimensional volume of the $n$-parallelotope formed by the column vectors of $\mathbf{A}$.

    Plus, a known property of the determinant is

    $$\det(\mathbf{A}) = \prod_{i=1}^n \lambda_i$$ where $\lambda_i$ are the eigenvalues of $\mathbf{A}$.

    The determinant of a covariance matrix is often called generalized variance, a useful real-valued statistic representing the dispersion of the data which is a function of the variances and the covariances of the different features in the dataset. The bigger the variances the bigger the determinant. The bigger the covariances, the smaller the determinant.

    Coming back to our initial problem of why maximizing $\frac{\det( \mathbf{s}_\mathrm{B})}{\det(\mathbf{s}_\mathrm{W})}$ makes sense. It is now clear that maximizing $\det( \mathbf{s}_\mathrm{B})$ maximizes the dispersion of the clusters formed by regrouping the points belonging to the same class while minimizing $\det( \mathbf{s}_\mathrm{W})$ minimizes the dispersion of the points of the same class. Note that we are talking about the variances and covariances of the projected data and that $\mathbf{s}_\mathrm{B} = \mathbf{W}^\mathrm{T}\mathbf{S}_\mathrm{B}\mathbf{W}$ and $\mathbf{s}_\mathrm{W} = \mathbf{W}^\mathrm{T}\mathbf{S}_\mathrm{W}\mathbf{W}$.

  2. Work in progress

0
On

I was also typing this up for myself - sorry if some of the notations don't align perfectly
Let $S_w,S_b \in \mathcal{S}_+^m$, where $S_w$ is invertible, and $\text{rank}S_b=K-1 \leq m$, so $S_b$ might not be invertible. For a fixed $k \leq K-1$, we'd like to solve \begin{equation} \text{arg}\max_{Q \in \mathbb{R}^{m \times k}}\frac{\det Q^\top S_bQ}{\det Q^\top S_wQ} \tag{1} \end{equation} Note that WLOG we may use $\tilde{Q}\leftarrow S_w^{1/2}Q$, $\tilde{S}_w \leftarrow I_m$, and $\tilde{S}_b \leftarrow S_w^{-1/2}S_bS_w^{-1/2}$. After these reassignments, we drop the tilde's for convenience, and the problem (1) becomes $\text{arg}\max_Q \det(Q^\top S_b Q)/ \det(Q^\top Q)$, or equivalently \begin{equation} \text{arg}\max_{Q \in \mathbb{R}^{m \times k}}\{ \log \circ \det(Q^\top S_bQ)- \log \circ \det(Q^\top Q) \} \tag{2} \end{equation} For convenience, denote the objective $J(Q):= \log \circ \det(Q^\top S_bQ)- \log \circ \det(Q^\top Q)$.

Lemma: (Matrix derivatives) For $A \in \mathbb{R}^{m \times m}$, if $f(A):= \log \circ \det(A)$, then $f'(A)(H)= \text{tr}(A^{-1}H)$. For $Q \in \mathbb{R}^{m \times k}$, if $f(Q):=Q^\top AQ$, then $f'(Q)(H)=Q^\top AH+H^\top AQ$.

For a proof of this Lemma, see here. Using these derivatives and the Chain Rule, we can differentiate the objective of (2): [ \begin{split} J'(Q)(H)&=\text{tr}\{ (Q^\top S_bQ)^{-1}(Q^\top S_bH+H^\top S_bQ) \} -\text{tr}\{ (Q^\top Q)^{-1}(Q^\top H+H^\top Q) \} \\ &= 2 \text{tr}\{ (Q^\top S_bQ)^{-1}Q^\top S_bH- (Q^\top Q)^{-1}Q^\top H \} \end{split}] In the 2nd line, we used that trace is linear and $\text{tr}(A^\top)= \text{tr}(A)$. To satisfy the 1st order condition $J'(Q)(H)= \mathbf{0}$, it's enough to have \begin{equation} (Q^\top S_bQ)^{-1}Q^\top S_b=(Q^\top Q)^{-1}Q^\top \tag{3} \end{equation} Claim: To satisfy (3), it's enough to have the columns of $Q$ be unit eigenvectors of $S_b$ with positive eigenvalues.

To see this, say $Q=[q_1, \ldots,q_k]$, with eigenvalues $\lambda_1, \ldots, \lambda_k>0$. Then $q_i^\top S_bq_j= \lambda_i1(i=j)$ and $q_i^\top q_j=1(i=j)$, so both sides of (3) simplify to $Q^\top$. At this value of $Q$, we have $J(Q)= \sum_{j=1}^k \log \lambda_j$. To maximize this, we should take the $k$ largest eigenvalues and eigenvectors. However, since we made transformations at the start to go from problem (1) to (2), (add tilde's where appropriate to make the distinction clear), a solution of (1) is $Q=S_w^{-1/2}\tilde{Q}$. Since $\tilde{S}_b \tilde{Q}= \tilde{Q}D$ where $D:= \text{diag}( \lambda_1, \ldots, \lambda_k)$. Left- multiplying both sides by $S_w^{-1/2}$, this says precisely that $S_w^{-1}S_bQ=QD$. So to solve (1), we just need the eigen- decomposition of $S_w^{-1}S_b$.

Remark This matrix isn't symmetric in general, but it is diagonalizable (link). Interestingly, $A,B \in \mathcal{S}^m$ doesn't imply $AB$ is diagonalizable (link): 1 of them must be positive definite.