Change from differentiation wrt to matrix to wrt to inverse of matrix for symmetric matrices

164 Views Asked by At

For the rule below:

$$ \frac{\partial J}{\partial \mathbf{A}}= -\mathbf{A}^{-T} \frac{\partial J}{\partial \mathbf{W}} \mathbf{A}^{-T} $$

where $\mathbf{A}$ is an invertible square matrix, $\mathbf{W}$ is the inverse of $\mathbf{A}$, and J is a function (see end of section 2.2 in matrix cookbook https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf)

Does this rule hold if $\mathbf{A}$ is a symmetric matrix?

2

There are 2 best solutions below

1
On BEST ANSWER

If $A$ is symmetric but not invertible, the rule won't hold as the inverse of $A$ is not even defined.

And if $A$ is symmetric invertible... then it is invertible and the formula holds as for any invertible matrix.

0
On

$ \def\b{\bullet} \def\e{\varepsilon} \def\m#1{\left[\begin{array}{c}#1\end{array}\right]} \def\p#1#2{\frac{\partial #1}{\partial #2}} $I really like The Matrix Cookbook but the section on structured matrices is not very good, so here's a different approach to the subject.

Given a vector of parameters $\{p\}$ and matrix basis $\{B_i\}$ $$\eqalign{ p &= \m{\alpha \\ \beta},\qquad B_1 = \m{1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0},\qquad B_2 = \m{0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0} \\ }$$ create a structured matrix $\{A\}$ and cost function $\{\phi\}$ $$\eqalign{ A &= \sum_{i=1}^2\;p_iB_i \;=\; \m{\alpha & \alpha & 0 & 0 \\ 0 & \beta & 0 & 0 \\ 0 & \beta & \beta & 0},\qquad &\phi = \tfrac 12\Big\|AX-Y\Big\|_F^2 \\ }$$ Note that $(\alpha,\beta)$ are the only independent variables in the entire problem.

When $A$ is unconstrained it's easy to calculate the gradient/differential of the cost $$\eqalign{ G = \p{\phi}{A} = (AX-Y)X^T \quad\implies\quad d\phi = G\b dA \\ }$$ where the bullet denotes the matrix inner product, i.e. $$\eqalign{ G\b dA &= \sum_{i=1}^3\sum_{j=1}^4 G_{ij}\;dA_{ij} \;=\; {\rm Tr}(G^TdA) \\ }$$ Because of the structure which was imposed on $A$, its differential is also structured $$dA = \sum_{i=1}^2 B_i\,dp_i$$ Substituting this expression leads to the parametric gradient $$\eqalign{ d\phi &= \sum_{i=1}^2\;G\b(B_i\,dp_i) = \sum_{i=1}^2\left(\p{\phi}{p_i}\right)dp_i \quad\implies\quad \p{\phi}{p_i} = G\b B_i \\ }$$ At this point, one would do all further calculations in terms of the $p$-vector.

Now comes the weird part...

Every basis $\{B_i\}$ has a dual basis $\{B_i^\delta\}$ which spans the same subspace $\cal S$, but is orthonormal with respect to the inner product $$B_i\b B_j^\delta \;=\; \delta_{ij}$$ Some bases are self-dual, such as the canonical vector basis $\{\e_i\}$, but in general determining the dual basis requires a pseudoinverse calculation $$\eqalign{ &\;b_k = {\rm vec}(B_k) \qquad &\;b_k^\delta = {\rm vec}(B_k^\delta) \\ &\m{b_1 & b_2 &\ldots & b_p}^+ = &\m{b_1^\delta & b_2^\delta &\ldots & b_p^\delta}^T \\ }$$ In the vector case, the gradient with respect to the $p$-vector can be written as the sum of each component multiplied by the corresponding vector from the dual basis, i.e. $$\eqalign{ \p{\phi}{p} &= \sum_{i=1}^2 \left(\p{\phi}{p_i}\right)\e_i \\ }$$ Many authors extend this idea and define the structured gradient as the matrix $$\eqalign{ \left(\p{\phi}{A}\right)_S &= \sum_{i=1}^2\left( \p{\phi}{p_i} \right) B_i^\delta \\ &= \sum_{i=1}^2\left(G\b B_i\right) B_i^\delta \\ &= G\b\left(\sum_{i=1}^2 B_i B_i^\delta \right) \\ &= G\b{\cal B} \\ }$$ where $\cal B$ is a fourth-order tensor with components $${\cal B}_{jk\ell m} = \sum_{i=1}^2\;\left(B_i\right)_{jk}\,\left(B_i^\delta\right)_{\ell m} \\$$ The $\cal B$ tensor is a projector into the subspace $\big(\,{\cal B}\b X\in{\cal S}\;\;{\rm for}\;X\in{\mathbb R}^{3\times 4}\big)$ where it also acts as an identity tensor for the subspace $\big({\cal B}\b M=M\b{\cal B} = M\;\;{\rm for}\;M\in{\cal S}\big)$ .

If the basis spans the whole space $\,{\cal S}\equiv{\mathbb R}^{3\times 4}\,$ then $\cal B$ becomes the true identity tensor $\cal I$, and the structured gradient is identical to the full unstructured gradient $G$ (as expected). $$\eqalign{ {\cal B}_{jk\ell m} \;&\to\; {\cal I}_{jk\ell m} = \delta_{j\ell}\delta_{km} \\ (G\b{\cal B}) \;&\to\; (G\b{\cal I}) = G \\ }$$


As a concrete example, let's examine a symmetrically constrained $2\times 2$ matrix. $$\eqalign{ p &= \m{\alpha \\ \beta \\ \lambda},\qquad B_1 = \m{1 & 0 \\ 0 & 0},\qquad B_2 = \m{0 & 0 \\ 0 & 1},\qquad B_3 = \m{0 & 1 \\ 1 & 0} \\ A &= \m{\alpha & \lambda \\ \lambda & \beta} \quad=\quad \alpha B_1 + \beta B_2 + \lambda B_3,\qquad B_k^\delta = \frac{B_k}{B_k\b B_k} \\ }$$ The structured gradient calculation then goes as follows $$\eqalign{ \left(\p{\phi}{A}\right)_S &= \frac{(G\b B_1)B_1}{B_1\b B_1} + \frac{(G\b B_2)B_2}{B_2\b B_2} + \frac{(G\b B_3)B_3}{B_3\b B_3} \\ &= G_{11}\,B_1 +G_{22}\,B_2 +\tfrac 12(G_{12}+G_{21})\,B_3 \\ &= \m{G_{11} & \tfrac 12(G_{12}+G_{21}) \\ \tfrac 12(G_{12}+G_{21}) & G_{22}} \\ &= \left(\frac{G+G^T}{2}\right) \;\doteq\; {\rm Sym}(G) \\ }$$ But The Matrix Cookbook uses the regular basis instead of the dual basis which results in the following miscalculation $$\eqalign{ \left(\p{\phi}{A}\right)_{S^*} &= \left(G\b B_1\right)B_1 + \left(G\b B_2\right)B_2 + \left(G\b B_3\right)B_3 \\ &= G_{11}\,B_1 +G_{22}\,B_2 +(G_{12}+G_{21})\,B_3 \\ &= \m{G_{11} & (G_{12}+G_{21}) \\ (G_{12}+G_{21}) & G_{22}} \\ &= G+G^T-{\rm Diag}(G) \\ }$$

The skew-symmetric case is similar but is seldom mentioned.
There is only one parameter and one matrix in the basis $$\eqalign{ p &= \m{\alpha},\qquad B = \m{0 & 1 \\ -1 & 0},\qquad B^\delta = \frac{B}{B\b B} \\ A &= \m{0 & \alpha \\ -\alpha & 0} \;\;=\;\; \alpha B \\ }$$ and the structured gradient is $$\eqalign{ \left(\p{\phi}{A}\right)_S &= \frac{(G\b B)B}{B\b B} \\ &= \tfrac 12(G_{12}-G_{21})\,B \\ &= \m{0 & \tfrac 12(G_{12}-G_{21}) \\ \tfrac 12(G_{21}-G_{12}) & 0} \\ &= \left(\frac{G-G^T}{2}\right) \;\doteq\; {\rm Skew}(G) \\ }$$ If you use $B$ instead of $B^\delta$ in this case, the gradient has the right direction but the wrong length, i.e. $$\eqalign{ \left(\p{\phi}{A}\right)_{S^*} &= \left(G-G^T\right) \;=\; 2\;{\rm Skew}(G) \\ }$$