I've seen this a couple of times during a machine learning lecture, for example in context of LDA, when looking at the Fisher Criterion. It can be expressed in two ways:
$$J(w) = \frac{(m_1 - m_2)^2}{s_1^2 + s_2^2}$$
And, formulated in the input space:
$$ J(w) = \frac{w^TS_Bw}{w^TS_Ww}$$ Where $S_B$ is the between-class covariance matrix and $S_W$ is the within-class covariance matrix. I understand that we can project a point $p$ into our new space like this: $w^Tp$, and I guess something similar happens here. But what exactly is the intuition of this calculation: $w^TS_Bw$? It looks like it yields large values when $w$ is perpendicular to the first eigenvector of our covariance matrix. But why? Im struggling to fully understand how to reconstruct this equation.
Thanks!
Let $\mu_1$ and $\mu_2$ be the means of the classes in the original space and let $m_1=w^Tm_1$ and $m_2=w^Tm_2$ be the means of the projected space.
$$S_b=N_1(\mu_1 - \mu)(\mu_1-\mu)^T+N_2(\mu_2-\mu)(\mu_2-\mu)^T$$
where the mean is $\mu=\frac{N_1\mu_1+N_2\mu_2}{N_1+N_2}$
Hence $\mu_1-\mu=\frac1{N_1+N_2}(\mu_1-\mu_2)$ and $\mu_2-\mu=\frac{1}{N_1+N_2}(\mu_2-\mu_1)$
Hence $$S_b=(\mu_1-\mu_2)(\mu_1-\mu_2)^T.$$
$$w^TS_bw=(w^T\mu_1-w^T\mu_2)^2=(m_1-m_2)^2$$
Also, since $$S_w=\sum_{j=1}^2\sum_{i=1}^{N_j}(x_{ij}-\mu_j)(x_{ij}-\mu_j)^T$$ Similarly, $$w^TS_ww=\sum_{j=1}^2\sum_{i=1}^{N_j}(w^Tx_{ij}-m_j)^2=s_1^2+s_2^2$$