Proof for the equivalence of the convex optimization problem in Kernel Fisher LDA paper.

208 Views Asked by Bumbble Comm At 28 Mar 2026 - 1:07

All, I am trying to understand the convex optimization move made from eqns. 3 -> eqns. 4 in the Kernel Fisher LDA paper. The authors mention that the proof of equivalence is straightforward, and decide that it is too lengthy to present. Could someone help derive this formally and present it here? Thank you very much!

Following @Lutz Lehmann's comment, I attempt to describe the variables and identify the key moves made in the paper. I hope that this makes it easier for us to derive the formal equivalence. Please feel free to correct any errors and make edits for clarity as you see fit.

Training data: ${x_i: i=1\dots l}$.
Training labels: ${y_i: i=1\dots l}, y_i\in \{-1,1\}$ corresponding to each of the two classes.
$\textbf{1}$, vector of all ones. $\textbf{1}_{1}, \textbf{1}_{2}$ binary $(0,1)$ vectors corresponding to the two classes.

Standard Fisher LDA solves : $argmax_w \frac{w^T S_B w}{ w^T S_W w }$. Here $S_B=(m_1-m_2)(m_1-m_2)^T$ is the between-class covariance, and $m_k$ the mean feature vector for class $k$. $S_W=\sum_{k=1,2} \sum_{i\in I_k} (x_i -m_k) (x_i-m_k)^T$ is the sum of within-class covariances. $I_k$ indexes the training data for class $k$ .
The authors next use the representation $w=\sum_i\alpha_i \Phi(x_i)$, where the Kernel function is defined as $K(x_i, x_j):=\Phi(x_i)^T\Phi(x_j)$. The projection of a test point $x$ onto the discriminant is then represented as: $w^T \Phi(x)=\sum_i \alpha_i K(x_i,x)$.
With this representation, we rewrite our objective in terms of $\alpha$: $J(\alpha)=\frac{\alpha^T M \alpha}{\alpha^T N \alpha}$, where $M=\mu \mu^T$, $N=KK^T - \sum_{k=1,2}l_k \mu_k \mu_k^T$, with $\mu=\mu_2-\mu_1$, $\mu_k=\frac{1}{l_k}K\textbf{1}_k$.
Given that $M$ is rank 1, the authors choose the following normalization for $\alpha$, and rewrite: $argmax_{\alpha} J(\alpha)$, equivalently as: $argmin_{\alpha} \alpha^T N \alpha ~s.t., \alpha^T(\mu_2 -\mu_1)=2$. To this objective, Eqn. 3 in the paper appends an additional regularizing term $C P(\alpha)$ leading to:

$~~~~~argmin_{\alpha} \alpha^T N \alpha + CP(\alpha) ~s.t., \alpha^T(\mu_2 -\mu_1)=2$

Things become less clear to me after this step. The authors write out a convex optimization problem in Eqn. 4 as:

$~~~~~argmin_{\alpha,\xi, b} ~\xi^T\xi + CP(\alpha) ~s.t., ~K\alpha + b\textbf{1}=y+\xi, \textbf{1}_k^T \xi = 0, k\in\{1,2\} $. Here $b$ is scalar.

Through Proposition 1, the authors claim that Eqns. 3 and 4 (mentioned above in points 4 and 5) are equivalent, and that the formal equivalence proof is straightforward and lengthy. The authors offer some intuition above and below Eqn. 4. I would greatly appreciate any insights you could share! Thanks a lot!

Original Q&A

Proof for the equivalence of the convex optimization problem in Kernel Fisher LDA paper.

Related Questions in LINEAR-ALGEBRA

Related Questions in OPTIMIZATION

Related Questions in CONVEX-ANALYSIS

Related Questions in NUMERICAL-LINEAR-ALGEBRA

Related Questions in QUADRATIC-PROGRAMMING

Trending Questions

Popular # Hahtags

Popular Questions