Gradient of $A \mapsto \sigma_i (A)$

1.3k Views Asked by At

Let $ A $ be an $m \times n$ matrix of rank $ k \le \min(m,n) $. Then we decompose $ A = USV^T $, where:

  • $U$ is $m \times k$ is a semi-orthogonal matrix.

  • $S$ is $k \times k$ diagonal matrix , of which its diagonal entries are called singular values of $ A $. we denote them by $ \sigma _i = S_{ii} $.

  • $V$ is $n \times k$ semi-orthogonal matrix.
  • Definition: a semi-orthogonal matrix $ Q $ is a non-square matrix where $ Q^{T}Q=I $.

This is the singular value decomposition (SVD) of matrix $ A $. We define a function $ f_i: \mathbb R^{ m \times n} \to \mathbb R $ by $ f_i (A) = \sigma_i (A) $. I am interested in finding the gradient of $ f_i $ in order to practice matrix defferentiation.


I hope you can help me starting with the first steps. Here are the hints that I have been given in order to find the solution, and feel free to use them:

  1. Use the product rule of differentials to calculate $ dA $ where A is considered as function of $ U $, $ S $ and $ V $.
  2. The entries of the diagonal of anti-symmetric matrix are all zeros.
  3. The Hadamard product of two matrices $ A,B $ of the same size , is denoted by $$ (A \circ B )_{ij} = A_{ij} \cdot B_{ij} $$
  4. Use the cyclic property of the trace operator. That is:

    $$\mbox{Tr}(ABC) = \mbox{Tr}(CAB) = \mbox{Tr}(BCA)$$

    1. The trace of a scalar is a scalar. That is, given $ a \in \mathbb R $:

      $$ \mbox{Tr}(a) = a $$

I stuck right at the beginning, I found that the product rule is:

$$ dA = dUSV^{T} + UdSV^{T} + USdV^{T} $$

Also, I have tried to calculate $ A^{T}A $ as trying to find a useful manipulation where I can use it for the solution, and I got that it is equal to: $ VS^{T} SV^{T} $. First of all, is this what they meant by the product rule? And, second, how do I continue from here?

3

There are 3 best solutions below

2
On BEST ANSWER

Let $\{e_i\}$ denote the standard basis vectors. Then $q_i=Qe_i$ is the $i^{th}$ column of $Q$.
The definition of semi-orthogonality says that the columns of $Q$ are orthonormal, i.e. $$\eqalign{ I &= Q^TQ \\ e_i^T(I)e_j &= e_i^T(Q^TQ)e_j \\ \delta_{ij} &= q_i^Tq_j \\ }$$ Multiply the SVD by the $i^{th}$ columns of $(U,V)$ to isolate the $i^{th}$ singular value. $$\eqalign{ A &= \sum_{j=1}^k \sigma_j u_j v_j^T \\ u_i^TAv_i &= \sum_{j=1}^k \sigma_j (u_i^Tu_j)(v_j^Tv_i) = \sum_{j=1}^k \sigma_j\,\delta_{ij}^2 \;=\; \sigma_i \\ }$$ Rearrange this result with the help of the trace/Frobenius product $\Big(A\!:\!B={\rm Tr}\!\left(A^TB\right)\Big)$
Then calculate the differential and gradient. $$\eqalign{ \sigma_i &= u_iv_i^T:A \\ d\sigma_i &= u_iv_i^T:dA \\ \frac{\partial\sigma_i}{\partial A} &= u_iv_i^T \\ }$$ Similarly, the singular vectors also vary with $A$. $$\eqalign{ \sigma_i u_i &= Av_i \\ \sigma_i u_i &= \left(v_i^T\otimes I_m\right){\rm vec}(A) \\ \sigma_i\,du_i &= \left(v_i^T\otimes I_m\right){\rm vec}(dA) \\ \frac{\partial u_i}{\partial{\rm vec}(A)} &= \frac{v_i^T\otimes I_m}{\sigma_i} \\ \\ \\ \sigma_i v_i^T &= u_i^TA \\ \sigma_i v_i &= \left(I_n\otimes u_i^T\right){\rm vec}(A) \\ \sigma_i\,dv_i &= \left(I_n\otimes u_i^T\right){\rm vec}(dA) \\ \frac{\partial v_i}{\partial{\rm vec}(A)} &= \frac{I_n\otimes u_i^T}{\sigma_i} \\ \\ }$$

0
On

Here we consider the eigenvalues of $B=A^TA$, a symmetric $\geq 0$ matrix, where $spectrum(B)=\sigma_1\geq \sigma_2,\cdots$. If the $(\sigma_i)$ are distinct, then they admit derivative locally and even globally wrt the parameters. More precisely,

let $t\in(a,b)\mapsto B(t)\in sym_n$ be a smooth function. If, for every t, the eigenvalues of $B(t)$ are simple, then there are smooth local parametrizations of the spectrum: $\sigma_1(t),\cdots,\sigma_n(t)$.

$(*)$ More generally, this property stands when the mutiplicity of the eigenvalues are locally constant and is valid even for the non-symmetric matrices.

This is no longer the case when the eigenvalues ​​may be multiple. There are (counter-examples due to Rellich -1955-) smooth functions $B(t)$ with multiple eigenvalues s.t. one eigenvalue is only Lipschitz-continuous (and not derivable) and the associated eigenvector is not even continuous!

Yet, when $B(t)$ is analytic, we can do better

$\textbf{Proposition.}$ Assume that $t\in\mathbb{R}\rightarrow B(t)\in sym_n$ is analytic. Then, there is a numbering of the eigenvalues $(\lambda_i)_{i\leq n}$ and an ordered basis of (unit length) eigenvectors (associated to the $(\lambda_i)$) which are globally analytically parametrizable (even if the eigenvalues present some mutiplicities -their paths cross-).

Note that the natural ordering of the eigenvalues is not necessarily met; for example

$B(t)=diag(t+2,2t+2)$; when $t$ goes through $0$, $\sigma_1,\sigma_2$ are exchanged. In particular, $\sigma_1,\sigma_2$ (when they are ordered) have no derivative. However, the eigenvalues-functions $\lambda_1=t+2,\lambda_2=2t+2$ have derivatives.

$\textbf{Remark 1}$. The above results stand only when $B$ depends on only one parameter $t$; if $B$ depends on $\geq 2$ parameters or if $B$ is only a normal matrix, then the results are much more complicated, cf. [4].

$\textbf{Remark 2}$. In general, $\sigma_i$ is Lipschitz and differentiable a.e.; when $\sigma_i(t_0)$ is a multiple eigenvalue, it has a derivative in $t_0$ if, as part of the above Proposition, there is $j$ s.t. $\sigma_i=\lambda_j$ (at least locally). Note that, in general, that does not work.

[1] Rellich: https://archive.org/details/perturbationtheo00rell/mode/2up

[2] Kazdan: https://arxiv.org/pdf/1903.00785.pdf

[3] About the roots of a polynomial, Michor: http://www.mat.univie.ac.at/~michor/roots.pdf

[4] Rainer: https://arxiv.org/pdf/1111.4475v2.pdf

0
On

Here is an answer using some basic convex analysis, and no calculation using coordinates or the canonical basis. It is more useful to work with $$f_r(A) = \sum_{i\le r} \sigma_i(A)$$ (sum of $r$ largest singular values). Of course, to deduce the derivative of the $r$-th singular value from the following argument, one can use that $\sigma_r(A) = f_r(A)-f_{r-1}(A)$ which gives the derivative of $\sigma_r(A)$ when both $f_r,f_{r-1}$ are differentiable at $A$.

We will gain much by noting that $f_r$ is convex, which follows from

$$f_r(A) = \sup_{M: \|M\|_{op}\le 1, rank(M)\le r} trace[A^TM].$$

Why is this supremum expression for $f_r(A)$ true? Writing the SVD $A=\sum_i u_i s_i v_i^T$, define $\tilde A = \sum_{i\le r} (s_i -s_r) u_i v_i^T$, so that $$ trace[A^TM] - trace[\tilde A^TM] = trace[(A-\tilde A)^T M] \le r s_r $$ because $(A-\tilde A)^T M$ has rank at most $r$ and operator norm at most $s_r$. Furthermore, $ trace[\tilde A^TM] \le \sum_{i\le r}(s_i - s_r) u_i^T M v_i \le \sum_{i\le r}(s_i - s_r) $ because $M$ has operator norm at most 1 and $\|u_i\|=\|v_i\|=1$. This proves $trace[A^TM] \le \sum_{i\le r} s_i$ and we have equality for $M=\sum_{i\le r} u_i v_i^T$.


Now that we have the supremum expression for $f_r(A)$, we know that $f_r$ is a convex function (because a function defined as the supremum of linear functions is always convex).

Next, we can use the fact that

a convex function $f_r:R^k\to R$ is differentiable at $A$ if and only if the subdifferential of $f_r$ at $A$ is a singleton, and in this case the gradient of $f_r$ is the unique element of the sub-differential.

(We do not prove this fact here, but any convex analysis book will give a proof). Here the subdifferential $\partial f_r(A)$ is the set of all matrices $G$ of the same size as $A$ such that for any matrix $H$ of the same size as $A$, $$ f_r(A+H) \ge f_r(A) + trace[H^TG]. $$ If there is a jump between the $r$-th and $r+1$-th singular value then for $H= - t M$ for $M=\sum_{i\le r} u_i v_i^T$ we have $$ f_r(A+H) = \sum_{i\le r} \sigma_i(A+H) = \sum_{i\le r} (s_i -t) $$ for all $t$ such that $s_r-t>s_{r+1}$. Then for $t$ small enough, any matrix $G$ in the subdifferential of $f_r$ at $A$ satisfies $$ \sum_{i\le r} (s_i - t) \ge \sum_{i\le r} s_i - t ~\text{trace}[M^TG] $$ or $0 \ge r - \text{trace}[M^TG]$. Applying $f_r(A) + f_r(H) \ge f_r(A+H) \ge f_r(A) + trace[H^TG]$ to $H=G$ further gives $f_r(G) \ge \|G\|_F^2$, or in terms of singular values, $$ \sum_{i\le r}\sigma_i(G) \ge \sum_{i\ge 1} \sigma_i(G)^2. $$ By the Cauchy-Schwarz inequality, it follows that $\|G\|_F^2 \le r$. Also, $\|M\|_F^2\le r$. Thus the inequlaity $0 \ge r - trace[M^TG]$ implies $0\ge (\|G\|_F^2 + \|M\|_F^2 - trace[M^TG])= \|G-M\|_F^2$ so that $M=\sum_{i\le r} u_i v_i^T$ is the unique element in the subdifferential of $f_r$ at $A$. By the above basic fact of convex analysis, it follows that the function $f_r$ is differentiable at $A$ whenever $\sigma_r(A)>\sigma_{r+1}(A)$, and that for $M=\sum_{i\le r}u_i v_i^T$ as above we have for any $H$ $$ f_r(A+H) = f_r(A) + trace[M^TH] + o(H). $$


We can also prove that $f_r$ is not differentiable at $A$ if $\sigma_r(A)=\sigma_{r+1}(A)$. Indeed, in this case the subdifferential contains both $$ M = \sum_{i\le r} u_i v_i^T$ \text{ and } M' = \sum_{i\le r-1} u_i v_i^T + u_{r+1} v_{r+1}^T. $$ By the characterization of differentiability for convex functions, the subdifferential at $A$ is non-unique and the function $f_r$ not differentiable at $A$.