How can maximizing $x^T A x$ where $A$ is positive semi-definite be reduced to maximizing $trace(x^T A x)$?

212 Views Asked by At

Suppose $A$ is a given matrix of shape $n \times n$, and $x$ is some unknown matrix of shape $n \times m$, the objective is

$$ \begin{array}{rl} & \max_x x^T A x \\ \text{subject to } & \lambda_i(A) \ge 0 \text{ for all } 1 \le i \le n \\ & \|x_i\|_2 = 1 \text{where $x_i$ is the $i^{th}$ column of $x$} \end{array} $$

where the objective is a $m \times m$ matrix.

My question is can it be reduced to

$$ \begin{array}{rl} & \max_x tr(x^T A x) \\ \text{subject to } & \lambda_i(A) \ge 0 \text{ for all } 1 \le i \le n \\ & \|x_i\|_2 = 1 \text{where $x_i$ is the $i^{th}$ column of $x$} \end{array} $$

where the objective is now a scalar.

This question came up as I was learning Fisher's discriminant analysis in the multi-class setting, where the objective is to find a matrix $W$ to transform $X$ onto a low-dimensional space so that the between-class covariance is maximized and the within-class covariance is minimized. And most books give the following criterion,

$$ \max_W \frac{tr(W^T S_b W)}{tr(W^T S_w W)} $$

where $S_b$ is the between-class scatter matrix, and $S_w$ is the within-class scatter. I believe the scatter matrix is used here instead of the estimate of covariance matrix is because they only differ by a positive constant term.

But I don't understand why we can formulate the criterion using trace when we want to maximize a matrix. In other words, I don't know the proof of this reduction.

UPDATE: What I've tried: In the context of Fisher's discriminant analysis, I think maximizing a covariance matrix directly amounts to maximizing the Frobenius norm of that matrix, because we want everything to covariate as much as possible, every elements in that matrix should be large in absolute value. But this is not rigorous enough to my liking. Even if it's correct, it only lead to

$$ \max_x x^T A x \iff \max_x \|x^T A x\|_F \iff \max_x \sqrt{tr(x^T A^T x x^T A x)} $$

1

There are 1 best solutions below

0
On

I just came into the similar problem of maximize a Frobenius norm as follows( with one more constraint about zero-mean data ):

Given a set of zero-mean data, i.e., $\boldsymbol{X} \in \mathbb{R}^{N \times D}$, whose covariance matrix is $\boldsymbol{\Gamma}_{\boldsymbol{X}} \in \mathbb{R}^{D \times D}$, we have a chance to add two outliers, i.e., $\boldsymbol{x}_{1}$ and $\boldsymbol{x}_{2}$ into $\boldsymbol{X}$ and formulate a new data matrix $\widehat{\boldsymbol{X}} \in \mathbb{R}^{(N+2) \times D}$. The two outliers satisfy that $\left\|\boldsymbol{x}_{1}\right\|_{2}=\left\|\boldsymbol{x}_{2}\right\|_{2}=1$. Try to ensure $\widehat{\boldsymbol{X}}$ to be a zero-mean data matrix and maximize the difference between its covariance matrix $\boldsymbol{\Gamma}_{\widehat{\boldsymbol{X}}}$ and the original $\boldsymbol{\Gamma}_{\boldsymbol{X}}$, i.e., $\max \left\|\boldsymbol{\Gamma}_{\widehat{\boldsymbol{X}}}-\boldsymbol{\Gamma}_{\boldsymbol{X}}\right\|_{F}^{2}$.

What I've thought about is:

The zero-mean condition force $\hat{\boldsymbol{X}}$ to be: $$ \hat{\boldsymbol{X}}^{T}=\left(\boldsymbol{X}^{T}, \boldsymbol{x}^{T},-\boldsymbol{x}^{T}\right) $$ and we know that $$ \|\boldsymbol{x}\|_{2}=\boldsymbol{x} \boldsymbol{x}^{T}=1 $$ We conclude that $$ \begin{aligned} \Gamma_{\hat{\boldsymbol{X}}} &=\frac{1}{N+1} \hat{\boldsymbol{X}}^{T} \hat{\boldsymbol{X}} \\ &=\frac{1}{N+1}\left(\boldsymbol{X}^{T} \boldsymbol{X}+2 \boldsymbol{x}^{T} \boldsymbol{x}\right) \\ &=\frac{1}{N+1}\left[(N-1) \Gamma \boldsymbol{X}+2 \boldsymbol{x}^{T} \boldsymbol{x}\right] \end{aligned} $$ So $$ \begin{aligned} \underset{\|\boldsymbol{x}\|_{2}=1}{\arg \max }\left\|\Gamma_{\hat{\boldsymbol{X}}}-\Gamma_{\boldsymbol{X}}\right\|_{F}^{2} &=\underset{\|\boldsymbol{x}\|_{2}=1}{\arg \max }\left\|\Gamma_{\hat{\boldsymbol{X}}}-\Gamma_{\boldsymbol{X}}\right\|_{F}^{2} \\ &=\underset{\|\boldsymbol{x}\|_{2}=1}{\arg \max }\left\|\boldsymbol{x}^{T} \boldsymbol{x}-\Gamma_{\boldsymbol{X}}\right\|_{F}^{2} \end{aligned} $$

In fact, for the symmetry of $\Gamma_{\boldsymbol{X}}$, we know

$$ \begin{aligned} &\underset{\|\boldsymbol{x}\|_{2}=1}{\arg \max }\left\|\boldsymbol{x}^{T} \boldsymbol{x}-\Gamma_{\boldsymbol{X}}\right\|_{F}^{2} \\ =&\underset{\|\boldsymbol{x}\|_{2}=1}{\arg \max }\left\{\operatorname{tr}\left(\left(\boldsymbol{x}^{T} \boldsymbol{x}-\Gamma_{\boldsymbol{X}}\right)^{T}\left(\boldsymbol{x}^{T} \boldsymbol{x}-\Gamma_{\boldsymbol{X}}\right)\right)\right\} \\ =&\underset{\|\boldsymbol{x}\|_{2}=1}{\arg \max }\left\{\operatorname{tr}\left(\boldsymbol{x}^{T} \boldsymbol{x}\right)-\operatorname{tr}\left(\Gamma_{\boldsymbol{X}} \boldsymbol{x}^{T} \boldsymbol{x}\right)-\operatorname{tr}\left(\boldsymbol{x}^{T} \boldsymbol{x} \Gamma_{\boldsymbol{X}}\right)\right\} \end{aligned} $$

$$ \begin{aligned} &\underset{\|\boldsymbol{x}\|_{2}=1}{\arg \max }\left\{\operatorname{tr}\left(\boldsymbol{x}^{T} \boldsymbol{x}\right)-\operatorname{tr}\left(\Gamma_{\boldsymbol{X}} \boldsymbol{x}^{T} \boldsymbol{x}\right)-\operatorname{tr}\left(\boldsymbol{x}^{T} \boldsymbol{x} \Gamma_{\boldsymbol{X}}\right)\right\} \\ &=\underset{\|\boldsymbol{x}\|_{2}=1}{\arg \max }\left\{\operatorname{tr}\left(\boldsymbol{x}_{\boldsymbol{x}^{T}}\right)-2 \operatorname{tr}\left(\boldsymbol{x} \Gamma_{\boldsymbol{X}} \boldsymbol{x}^{T}\right)\right\} \\ &=\underset{\|\boldsymbol{x}\|_{2}=1}{\arg \max }\left\{1-2 \operatorname{tr}\left(\boldsymbol{x} \Gamma_{\boldsymbol{X}} \boldsymbol{x}^{T}\right)\right\} \\ &=\underset{\|\boldsymbol{x}\|_{2}=1}{\arg \min }\left\{\boldsymbol{x} \Gamma_{\boldsymbol{X}} \boldsymbol{x}^{T}\right\} \end{aligned} $$ Because $\Gamma_{\boldsymbol{X}}$ is positive definite, apply EVD to $\Gamma_{\boldsymbol{X}}$ $$ \Gamma_{\boldsymbol{X}}=\boldsymbol{V} \boldsymbol{\Lambda} \boldsymbol{V}^{T} $$ and $$ \boldsymbol{V} \boldsymbol{V}^{T}=\boldsymbol{I} $$ So $$ \underset{\|\boldsymbol{x}\|_{2}=1}{\arg \min }\left\{\boldsymbol{x} \Gamma_{\boldsymbol{X}} \boldsymbol{x}^{T}\right\}=\underset{\|\boldsymbol{x}\|_{2}=1}{\arg \min }\left\{\boldsymbol{x} \boldsymbol{V} \boldsymbol{\Lambda} \boldsymbol{V}^{T} \boldsymbol{x}^{T}\right\} $$ Let $$ y=x V $$ We know that $$ \|\boldsymbol{y}\|_{2}^{2}=\boldsymbol{y} \boldsymbol{y}^{T}=\boldsymbol{x} \boldsymbol{V} \boldsymbol{V}^{T} \boldsymbol{x}^{T}=1 $$

Let $\sigma$ be the smallest eigenvalues in $\Lambda$, finally we get $$ \begin{aligned} \underset{\|\boldsymbol{x}\|_{2}=1}{\arg \min }\left\{\boldsymbol{x} \boldsymbol{V} \boldsymbol{\Lambda} \boldsymbol{V}^{T} \boldsymbol{x}^{T}\right\} &=\underset{\|\boldsymbol{y}\|_{2}=1}{\arg \min }\left\{\boldsymbol{y} \boldsymbol{\Lambda} \boldsymbol{y}^{T}\right\} \\ &=\sum_{i=1}^{N} \lambda_{i} \boldsymbol{y}_{i}^{2} \\ & \geqslant \sum_{i=1}^{N} \sigma \boldsymbol{y}_{i}^{2} \\ &=\sigma \end{aligned} $$ Assuming that $\sigma=\lambda_{k}$, the optimal solution is:

$$ \boldsymbol{y}= \begin{cases}1 & i=k \\ 0 & i \neq k\end{cases} $$ and $$ \boldsymbol{x}=\boldsymbol{y} \boldsymbol{V}^{-1}=\boldsymbol{y} \boldsymbol{V}^{T} $$

So, maybe you could simplify your optimization object through the 2-norm=1 of $\boldsymbol{x}$ or convert the core to the trace..?