Reading through a paper on matrix factorizations, the authors use the following inequality (attributed to Von Neumann) to prove an intermediate result:
For all $X, Y ∈ S_d^{++}$ (d-dimensional positive definite matrices), $tr(XY ) ≥ \sum_i \lambda_i \mu_i$, where where $\lambda_i$ and $\mu_i$ are the eigenvalues of $X$ and $Y$ in nonincreasing and nondecreasing order, respectively. The equality is attained whenever $X = UDiag(\lambda)U^T , Y = UDiag(\mu)U^T$ for some orthonormal $U$.
The authors reference this paper as proof A. W. Marshall and I. Olkin. Inequalities: Theory of Majorization and its Applications . Academic Press, 1979. I don't have access to this publication - and it's not clear to me why this should be true.
I would appreciate if anyone could provide a proof for this statement.
Thanks.
I apologize for the piece of the following that is a little informal, but I suspect you will get the drift. The two positive definite matrices can always be written as $$\overline {\underline {\bf{X}} } = \underline {\overline {\bf{U}} } \,\underline {\overline {\bf{\Lambda }} } \,{\underline {\overline {\bf{U}} } ^T}{\rm{ and }}\overline {\underline {\bf{Y}} } = \underline {\overline {\bf{V}} } \,\underline {\overline {\bf{{\rm M}}} } \,{\underline {\overline {\bf{V}} } ^T}$$ where $\underline {\overline {\bf{U}} }$ and $\underline {\overline {\bf{V}} }$ are orthonormal, while $$\underline {\overline {\bf{\Lambda }} } = \left[ {\begin{array}{*{20}{c}} {{\lambda _1}}& \cdots &0\\ \vdots & \ddots & \vdots \\ 0& \cdots &{{\lambda _D}} \end{array}} \right]{\rm{ and }}\underline {\overline {\bf{{\rm M}}} } = \left[ {\begin{array}{*{20}{c}} {{\mu _1}}& \cdots &0\\ \vdots & \ddots & \vdots \\ 0& \cdots &{{\mu _D}} \end{array}} \right]$$ with all ${\lambda _i}$ and ${\mu _i}$ positive and arranged in non-increasing order. Then the properties of the trace operation permit one to write $$\begin{array}{l} {\rm{tr}}\left( {\overline {\underline {\bf{X}} } \,\overline {\underline {\bf{Y}} } } \right) = {\rm{tr}}\left( {\underline {\overline {\bf{U}} } \,\underline {\overline {\bf{\Lambda }} } \,{{\underline {\overline {\bf{U}} } }^T}\underline {\overline {\bf{V}} } \,\underline {\overline {\bf{{\rm M}}} } \,{{\underline {\overline {\bf{V}} } }^T}} \right) = {\rm{tr}}\left( {{{\underline {\overline {\bf{V}} } }^T}\underline {\overline {\bf{U}} } \,\underline {\overline {\bf{\Lambda }} } \,{{\underline {\overline {\bf{U}} } }^T}\underline {\overline {\bf{V}} } \,\underline {\overline {\bf{{\rm M}}} } } \right)\\ {\rm{ }} = {\rm{tr}}\left( {{{\underline {\overline {\bf{W}} } }^T}\underline {\overline {\bf{\Lambda }} } \,\underline {\overline {\bf{W}} } \,\underline {\overline {\bf{{\rm M}}} } } \right) = {\rm{tr}}\left( {{{\left( {\underline {\overline {\bf{\Lambda }} } \,\underline {\overline {\bf{W}} } } \right)}^T}\underline {\overline {\bf{W}} } \,\underline {\overline {\bf{{\rm M}}} } } \right) \end{array}$$ where $$\underline {\overline {\bf{W}} } = {\underline {\overline {\bf{U}} } ^T}\underline {\overline {\bf{V}} } = \left[ {\begin{array}{*{20}{c}} {{{\underline {\bf{w}} }_1}}& \cdots &{{{\underline {\bf{w}} }_D}} \end{array}} \right]$$ remains orthonormal. A further property of the trace operation is $${\rm{tr}}\left( {{{\overline {\underline {\bf{A}} } }^T}\overline {\underline {\bf{B}} } } \right) = {\rm{vec}}{\left( {\overline {\underline {\bf{B}} } } \right)^T}{\rm{vec}}\left( {\overline {\underline {\bf{A}} } } \right)$$ where ${\rm{vec}}\left( \bullet \right)$ denotes the operation of rearranging the $D\,x\,D$ matrix into a ${D^2}\,x\,1$ vector by concatenating matrix columns, so that $${\rm{tr}}\left( {\overline {\underline {\bf{X}} } \,\overline {\underline {\bf{Y}} } } \right) = {\left[ {\begin{array}{*{20}{c}} {{\mu _1}{{\underline {\bf{w}} }_1}}\\ \vdots \\ {{\mu _D}{{\underline {\bf{w}} }_D}} \end{array}} \right]^T}\left[ {\begin{array}{*{20}{c}} {\underline {\overline {\bf{\Lambda }} } \,{{\underline {\bf{w}} }_1}}\\ \vdots \\ {\underline {\overline {\bf{\Lambda }} } \,{{\underline {\bf{w}} }_D}} \end{array}} \right] = \sum\limits_{i = 1}^D {{\mu _i}\underline {\bf{w}} _i^T\underline {\overline {\bf{\Lambda }} } \,{{\underline {\bf{w}} }_i}}$$ Now, remembering that each ${\underline {\bf{w}} _i}$ is a column of an orthonormal matrix, it is not terribly difficult to see that the largest possible value of the trace is when the ${\underline {\bf{w}} _i}$ are chosen to pair the value of ${\mu _i}$ completely with the corresponding ${\lambda _i}$, since they are both in non-increasing order. Similarly, after some thought, it is clear that the smallest possible value of the trace occurs when the ${\underline {\bf{w}} _i}$ are chosen to pair the value of ${\mu _i}$ completely with the corresponding ${\lambda _{D - i}}$ (i.e., in reverse order), so that $$\sum\limits_{i = 1}^D {{\mu _i}{\lambda _{D - i}}} \le {\rm{tr}}\left( {\overline {\underline {\bf{X}} } \,\overline {\underline {\bf{Y}} } } \right) \le \sum\limits_{i = 1}^D {{\mu _i}{\lambda _i}}$$ The maximum result occurs when $$\underline {\overline {\bf{W}} } = \underline {\overline {\bf{I}} } {\rm{ }} \Rightarrow {\rm{ }}\underline {\overline {\bf{V}} } = \underline {\overline {\bf{U}} } {\rm{ }} \Rightarrow {\rm{ }}\overline {\underline {\bf{X}} } = \underline {\overline {\bf{U}} } \,\underline {\overline {\bf{\Lambda }} } \,{\underline {\overline {\bf{U}} } ^T}{\rm{ and }}\overline {\underline {\bf{Y}} } = \underline {\overline {\bf{U}} } \,\underline {\overline {\bf{{\rm M}}} } \,{\underline {\overline {\bf{U}} } ^T}$$ The minimum result occurs when $$\underline {\overline {\bf{W}} } = \overline {\underline {\bf{J}} } = \left[ {\begin{array}{*{20}{c}} 0& \cdots &1\\ \vdots & {\mathinner{\mkern2mu\raise1pt\hbox{.}\mkern2mu \raise4pt\hbox{.}\mkern2mu\raise7pt\hbox{.}\mkern1mu}} & \vdots \\ 1& \cdots &0 \end{array}} \right]{\rm{ }} \Rightarrow {\rm{ }}\underline {\overline {\bf{V}} } = \underline {\overline {\bf{U}} } \,\overline {\underline {\bf{J}} } {\rm{ }} \Rightarrow {\rm{ }}\overline {\underline {\bf{X}} } = \underline {\overline {\bf{U}} } \,\underline {\overline {\bf{\Lambda }} } \,{\underline {\overline {\bf{U}} } ^T}$$ but that $$\overline {\underline {\bf{Y}} } = \underline {\overline {\bf{U}} } \left( {\overline {\underline {\bf{J}} } \,\underline {\overline {\bf{{\rm M}}} } \,\overline {\underline {\bf{J}} } } \right){\underline {\overline {\bf{U}} } ^T} = \underline {\overline {\bf{U}} } \,{\underline {\overline {\bf{{\rm M}}} } _r}{\underline {\overline {\bf{U}} } ^T}{\rm{ where }}{\underline {\overline {\bf{{\rm M}}} } _r} = \left[ {\begin{array}{*{20}{c}} {{\mu _D}}& \cdots &0\\ \vdots & \ddots & \vdots \\ 0& \cdots &{{\mu _1}} \end{array}} \right]$$