Derivative of $\varphi({\bf X}) = \sum_{i=1}^n \lambda_i({\bf X}) \log \lambda_i({\bf X})$

132 Views Asked by At

Let $\mathbb{S}_+^n$ denote the set of $n \times n$ symmetric positive definite matrices. Let scalar field $\varphi : \mathbb{S}_+^n \to \Bbb R$ be defined by

$$\varphi({\bf X}) := \sum_{i=1}^n \lambda_i({\bf X}) \log \lambda_i({\bf X})$$

where $\lambda_1({\bf X}), \ldots, \lambda_n({\bf X})$ are the eigenvalues of the matrix ${\bf X} \in \mathbb{S}_+^n$. What is the gradient of $\varphi$?

3

There are 3 best solutions below

0
On BEST ANSWER

$ \def\P{P^{-1}} \def\PT{P^{-T}} \def\p{\partial} \def\l{\lambda} \def\v{\varphi} $The Matrix Cookbook contains a particularly simple rule for the gradient of the trace of a scalar function applied to a matrix argument $$\eqalign{ \v &= \operatorname{Tr}(f(X)) \quad\implies\quad \frac{\p\v}{\p X} &= \Big[f'(X)\Big]^T }$$ Consider the following function (and its derivative) $$f(z)=z\,\log(z),\qquad f'(z) = 1+\log(z)$$ If the eigenvalues of $X$ are $\l_k,$ then the eigenvalues of the matrix $f(X)$ are $f(\l_k)$
Expanding the trace of the proposed function yields $$\eqalign{ \operatorname{Tr}(f(X)) = \sum_k f(\l_k) = \sum_k \l_k\log(\l_k) \\ }$$ which is indeed the objective function.
Therefore the required gradient is $$\eqalign{ \frac{\p\v}{\p X} &= \Big[I+\log(X)\Big]^T \\ }$$

1
On

$ \def\Sk{\sum_{k=1}^n} \def\l{\lambda} \def\o{{\tt1}} \def\LR#1{\left(#1\right)} \def\BR#1{\Big(#1\Big)} \def\op#1{\operatorname{#1}} \def\Diag#1{\op{Diag}\LR{#1}} \def\diag#1{\op{diag}\LR{#1}} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} $Assuming that all eigenvalues are distinct, and the left and right eigenvectors are $$\eqalign{ Xu_k &= \l_k u_k,\quad X^Tw_k &= \l_k w_k \qiq w_k^TX &= \l_k w_k^T \\ }$$ Then the derivatives of the eigenvalues wrt $X$ can be calculated as $$\eqalign{ &Xu_k = \l_k u_k \\ &dX\:u_k + X\:du_k = d\l_k\:u_k + \l_k\:du_k \\ &w_k^T\BR{dX\:u_k + X\:du_k} = w_k^T\BR{d\l_k\:u_k + \l_k\:du_k} \\ &{w_k^TdX\:u_k + \c{\l_k\:w_k^Tdu_k}} = {d\l_k\:w_k^Tu_k + \c{\l_k\:w_k^Tdu_k}} \\ &w_k^TdX\:u_k = d\l_k\LR{w_k^Tu_k} \qiq \grad{\l_k}{X} = \fracLR{w_ku_k^T}{w_k^Tu_k} \\ }$$ Differentiate the given function wrt $\l_k$ $$\eqalign{ \grad{\psi}{\l_k} &= {\o+\log(\l_k)} \\ }$$ and then wrt $X$ $$\eqalign{ \grad{\psi}{X} &= \Sk\BR{\o+\log(\l_k)}\gradLR{\l_k}{X} \\ &= \Sk\BR{\o+\log(\l_k)}\fracLR{w_ku_k^T}{w_k^Tu_k} \\ }$$

0
On

Let $f(x) = x\log x$. Then applying the functional calculus to the definition of $\varphi$ leads to

$$ \varphi(\mathbf{X}) = \operatorname{Tr}(f(\mathbf{X})) $$

for each $\mathbf{X} \in \mathbb{S}^n_+$. Then the derivative of $\varphi$ is a linear functional $\ell$ on the space $\mathbb{S}^n$ of all $n\times n$ symmetric real matrices such that

$$ \varphi(\mathbf{X}+\mathbf{H}) = \varphi(\mathbf{X}) + \ell(\mathbf{H}) + o(\mathbf{H}) \qquad\text{as}\quad \mathbf{H} \to 0.$$

Our goal is to identify the linear functional $\ell$ if possible. In doing so, we may endow $\mathbf{S}^n$ with the inner product $\langle \mathbf{X}, \mathbf{Y} \rangle = \operatorname{Tr}(\mathbf{X}^{\top}\mathbf{Y})$ and utilize the Riesz representation theorem to identify $\ell$ with an element $\mathbf{L} \in \mathbb{S}^n$ via the relation

$$ \ell(\mathbf{H}) = \operatorname{Tr}(\mathbf{L}^{\top}\mathbf{H}) .$$

We claim the following:

Claim. $\mathbf{L} = f'(\mathbf{X})^{\top} = f'(\mathbf{X})$.

To this end, let $\mathbf{X}_t$ be a smooth path in $\mathbb{S}^n_+$. Then $\mathbf{L}$ and $\mathbf{X}_t$ are related by

$$ \operatorname{Tr}(\mathbf{L}^{\top}\mathbf{X}_0') = \frac{\mathrm{d}}{\mathrm{d}t}\biggr|_{t=0} \varphi(\mathbf{X}_t). $$

To facilitate the computation of the right-hand side, we represent $\mathbf{X}_t$ in the form

$$ \mathbf{X}_t = \mathbf{U}_t^{\top} \mathbf{D}_t \mathbf{U}_t, $$

where $t \mapsto \mathbf{U}_t$ is a smooth path of orthogonal matrices and $t \mapsto \mathbf{D}_t$ is a smooth path of positive diagonal matrices. Then

\begin{align*} \frac{\mathrm{d}}{\mathrm{d}t}\biggr|_{t=0} \varphi(\mathbf{X}_t) = \frac{\mathrm{d}}{\mathrm{d}t}\biggr|_{t=0} \operatorname{Tr}(f(\mathbf{D}_t)) = \operatorname{Tr}(f'(\mathbf{D}_0)\mathbf{D}_0') = \operatorname{Tr}(f'(\mathbf{X}_0)\mathbf{U}_0^{\top}\mathbf{D}_0'\mathbf{U}_0). \end{align*}

On the other hand,

\begin{align*} \mathbf{X}'_0 &= \mathbf{U}_0^{\top}\mathbf{D}_0'\mathbf{U}_0 + (\mathbf{U}_0^{\top})'\mathbf{D}_0\mathbf{U}_0 + \mathbf{U}_0^{\top}\mathbf{D}_0\mathbf{U}_0' \\ &= \mathbf{U}_0^{\top}\mathbf{D}_0'\mathbf{U}_0 - \mathbf{U}_0^{\top}\mathbf{U}_0'\mathbf{X}_0 + \mathbf{X}_0\mathbf{U}_0^{\top}\mathbf{U}_0'. \end{align*}

Plugging this back, it follows that

\begin{align*} \frac{\mathrm{d}}{\mathrm{d}t}\biggr|_{t=0} \varphi(\mathbf{X}_t) &= \operatorname{Tr}(f'(\mathbf{X}_0)(\mathbf{X}'_0 + \mathbf{U}_0^{\top}\mathbf{U}_0'\mathbf{X}_0 - \mathbf{X}_0\mathbf{U}_0^{\top}\mathbf{U}_0')) \\ &= \operatorname{Tr}(f'(\mathbf{X}_0)\mathbf{X}'_0) + \operatorname{Tr}(f'(\mathbf{X}_0)\mathbf{U}_0^{\top}\mathbf{U}_0'\mathbf{X}_0) - \operatorname{Tr}(f'(\mathbf{X}_0)\mathbf{X}_0\mathbf{U}_0^{\top}\mathbf{U}_0'). \end{align*}

In the last line, it turns out that the last two terms cancel out each other, since

$$ \operatorname{Tr}(f'(\mathbf{X}_0)\mathbf{U}_0^{\top}\mathbf{U}_0'\mathbf{X}_0) = \operatorname{Tr}(\mathbf{X}_0f'(\mathbf{X}_0)\mathbf{U}_0^{\top}\mathbf{U}_0') = \operatorname{Tr}(f'(\mathbf{X}_0)\mathbf{X}_0\mathbf{U}_0^{\top}\mathbf{U}_0'). $$

Therefore it follows that

\begin{align*} \frac{\mathrm{d}}{\mathrm{d}t}\biggr|_{t=0} \varphi(\mathbf{X}_t) &= \operatorname{Tr}(f'(\mathbf{X}_0)\mathbf{X}'_0) \end{align*}

and we are done.