All, I am trying to understand the convex optimization move made from eqns. 3 -> eqns. 4 in the Kernel Fisher LDA paper. The authors mention that the proof of equivalence is straightforward, and decide that it is too lengthy to present. Could someone help derive this formally and present it here? Thank you very much!
Following @Lutz Lehmann's comment, I attempt to describe the variables and identify the key moves made in the paper. I hope that this makes it easier for us to derive the formal equivalence. Please feel free to correct any errors and make edits for clarity as you see fit.
- Training data: ${x_i: i=1\dots l}$.
- Training labels: ${y_i: i=1\dots l}, y_i\in \{-1,1\}$ corresponding to each of the two classes.
- $\textbf{1}$, vector of all ones. $\textbf{1}_{1}, \textbf{1}_{2}$ binary $(0,1)$ vectors corresponding to the two classes.
Standard Fisher LDA solves : $argmax_w \frac{w^T S_B w}{ w^T S_W w }$. Here $S_B=(m_1-m_2)(m_1-m_2)^T$ is the between-class covariance, and $m_k$ the mean feature vector for class $k$. $S_W=\sum_{k=1,2} \sum_{i\in I_k} (x_i -m_k) (x_i-m_k)^T$ is the sum of within-class covariances. $I_k$ indexes the training data for class $k$ .
The authors next use the representation $w=\sum_i\alpha_i \Phi(x_i)$, where the Kernel function is defined as $K(x_i, x_j):=\Phi(x_i)^T\Phi(x_j)$. The projection of a test point $x$ onto the discriminant is then represented as: $w^T \Phi(x)=\sum_i \alpha_i K(x_i,x)$.
With this representation, we rewrite our objective in terms of $\alpha$: $J(\alpha)=\frac{\alpha^T M \alpha}{\alpha^T N \alpha}$, where $M=\mu \mu^T$, $N=KK^T - \sum_{k=1,2}l_k \mu_k \mu_k^T$, with $\mu=\mu_2-\mu_1$, $\mu_k=\frac{1}{l_k}K\textbf{1}_k$.
Given that $M$ is rank 1, the authors choose the following normalization for $\alpha$, and rewrite: $argmax_{\alpha} J(\alpha)$, equivalently as: $argmin_{\alpha} \alpha^T N \alpha ~s.t., \alpha^T(\mu_2 -\mu_1)=2$. To this objective, Eqn. 3 in the paper appends an additional regularizing term $C P(\alpha)$ leading to:
$~~~~~argmin_{\alpha} \alpha^T N \alpha + CP(\alpha) ~s.t., \alpha^T(\mu_2 -\mu_1)=2$
- Things become less clear to me after this step. The authors write out a convex optimization problem in Eqn. 4 as:
$~~~~~argmin_{\alpha,\xi, b} ~\xi^T\xi + CP(\alpha) ~s.t., ~K\alpha + b\textbf{1}=y+\xi, \textbf{1}_k^T \xi = 0, k\in\{1,2\} $. Here $b$ is scalar.
- Through Proposition 1, the authors claim that Eqns. 3 and 4 (mentioned above in points 4 and 5) are equivalent, and that the formal equivalence proof is straightforward and lengthy. The authors offer some intuition above and below Eqn. 4. I would greatly appreciate any insights you could share! Thanks a lot!