Derivative of projection's norm squared with respect to a matrix

880 Views Asked by At

Background: Let $M^{n\times k}(\mathbb{R})$ denote the $n\times k$ matrices with real entries. For any smooth function $f: M^{n\times k}(\mathbb{R}) \to \mathbb{R}$, define the derivative $\frac{\partial f}{\partial A}$ by $$\left(\frac{\partial f}{\partial A}\right)_{i,j}:= \frac{\partial f}{\partial a_{ij}},$$

where $a_{i,j}$ is the entry of $A$ at the $i$-th row and $j$-th column. For any smooth curve $c:(-\epsilon,\epsilon)\to M^{n\times k}(\mathbb{R})$ with $c(0) = A, \dot c(0) = v$, $\frac{\partial f}{\partial A}$ satisfies: $$\frac{d}{dt}|_{t=0}f(c(t)) = \text{trace}\left(v^T \frac{\partial f}{\partial A}\right).$$

Question: Let $f: M^{n\times k}(\mathbb{R}) \to \mathbb{R}$ be defined by $$f(A):= \|(I - A(A^TA)^{-1}A^T) z\|^2,$$ where $z \in \mathbb{R}^n$ is a fixed vector and $\|\cdot\|$ denotes the Euclidean norm. I would like to compute $\frac{\partial f}{\partial A}$, at least at any $A$ which is full rank. Are there any tricks that can simplify this process, or is the easiest thing simply to compute each $\frac{\partial f}{\partial a_{i,j}}$ by "brute force"?

My attempt at a simpler solution: Let $c:(-\epsilon,\epsilon)\to M^{n\times k}(\mathbb{R})$ be a smooth curve with $c(0) = A, \dot c(0) = v$. I will assume that $A$ is full rank.

By the chain and product rules, we have $$\frac{d}{dt}|_{t=0}f(c(t)) = 2 z^T(I-A(A^TA)^{-1}A)\left[-v (A^TA)^{-1}A^T + A(A^TA)^{-1}(v^TA + A^Tv)(A^TA)^{-1}A^T - A(A^TA)^{-1}v^T\right]z, $$ where I have used the fact that for any smooth curve $b:(-\epsilon,\epsilon)\to GL(n,\mathbb{R})$, $\frac{d}{dt}|_{t=0}b^{-1}(t) = -b^{-1}(0)\dot b(0) b^{-1}(0)$.

To simplify notation, let $A^\dagger:= (A^T A)^{-1} A^T$ denote the Moore-Penrose pseudoinverse of $A$ and let $\Pi_{A_\perp}:= I - AA^\dagger$ be the orthogonal projection onto the orthogonal complement of the range of $A$. Substitution of this new notation yields: $$\frac{d}{dt}|_{t=0}f(c(t)) = 2 z^T\Pi_{A_\perp}\left[-v A^\dagger + (A^\dagger)^T(v^TA + A^Tv)A^\dagger - (A^\dagger)^T v^T\right]z. $$ Now, the kernel of $A^\dagger$ is equal to the orthogonal complement to the range of $A$, so the range of $(A^\dagger)^T$ is equal to the range of $A$. It follows that $\Pi_{A_\perp} (A^\dagger)^T = 0$. Thus we now have a simpler expression:

$$\frac{d}{dt}|_{t=0}f(c(t)) = -2 z^T\Pi_{A_\perp}v A^\dagger z = -2\text{trace}\left(z^T(A^\dagger)^Tv^T\Pi_{A_\perp}z\right) = -2\text{trace}\left(v^T \Pi_{A_{\perp}}zz^T(A^\dagger)^T\right).$$

Since $v \in M^{n\times k}(\mathbb{R})$ was arbitrary, it follows that

$$\frac{\partial f}{\partial A}(A) = -2 \Pi_{A_\perp} zz^T(A^\dagger)^T.$$

3

There are 3 best solutions below

1
On

For convenience, define a new variable and its differential (borrowing a result from Harville's "Matrix Algebra From a Statistician's Perspective") $$\eqalign{ M &= I-AA^\dagger \cr dM &= -d(AA^\dagger) \cr &= -M\,dA\,A^\dagger - (A^\dagger)^T\,dA^T\,M^T \cr &= -2\,{\rm sym}(M\,dA\,A^\dagger) \cr }$$ Then write the function in terms of the Frobenius (:) Inner Product and find its differential $$\eqalign{ f &= Mz:Mz \cr df &= 2\,Mz:dM\,z \cr &= 2\,Mzz^T:dM \cr &= -2\,Mzz^T:2\,{\rm sym}(M\,dA\,A^\dagger) \cr &= -2\,(2\,{\rm sym}(Mzz^T)):M\,dA\,A^\dagger \cr &= -2\,(Mzz^T+zz^TM^T):M\,dA\,A^\dagger \cr &= -2\,M^T(Mzz^T+zz^TM^T)(A^\dagger)^T:dA \cr }$$ Since $df=\big(\frac{\partial f}{\partial A}:dA\big),\,$ the gradient must be $$\eqalign{ \frac{\partial f}{\partial A} &= -2\,M^T(Mzz^T+zz^TM^T)(A^\dagger)^T \cr }$$ Since $M$ is an orthoprojector, we know several additional facts $$\eqalign{ M^2 &= M \cr M^T &= M \cr M(A^\dagger)^T &= 0 \cr MA &= 0 \cr }$$ which can be used to simplify the result to $$\eqalign{ \frac{\partial f}{\partial A} &= -2\,Mzz^T(A^\dagger)^T \cr }$$

0
On

That follows is a geometrical proof. Let $y=\Pi_{A_\perp} z$; then $y=z-AA^\dagger z$ where $AA^\dagger z$ is the orthogonal projection of $z$ on $im(A)$. Since $f(A)=y^Ty$, $Df_A(H)=2y^TDy(H)=2y^TD(z-AA^\dagger z)H=-2y^TD(AA^\dagger z)H=-2y^T(HA^\dagger z+AD(A^\dagger z)H)$ and finally

$Df_A(H)=-2y^THA^\dagger z=-2tr(H^Tyz^T(A^\dagger)^T)$.

Thus the gradient of $f$ is $\nabla (f)(A)=-2y(A^\dagger z)^T$. Since $A$ is one to one, $x=A^\dagger z$ is the unique $x$ s.t. the orthogonal decomposition $z=y+Ax$ stands. Thus $\nabla (f)(A)=-2yx^T$.

EDIT. Let $H$ be a small matrix and $f(A+H)=\tilde{y}$; we assume that $y\not= 0$.

Then $||\tilde{y}||^2-||y||^2\approx -2y^THx$, that implies: $||\tilde{y}||-||y||\approx \dfrac{-y^THx}{||y||}$.

0
On

You can pretty much do what @frank suggested, but without the need for the pseudoinverse.

Let $$\eqalign{ B &= A^TA \cr M &= I - AB^{-1}A^T \cr }$$ both of which are symmetric. Notice that $A^TM=0$, which will be useful later.

Now we can express the function, differential, and gradient as $$\eqalign{ f &= Mz:Mz \cr\cr df &= 2\,Mz:dM\,z \cr &= -2\,Mzz^T:\big(dA\,B^{-1}A^T + A\,dB^{-1}A^T + AB^{-1}dA^T\big) \cr &= -\big(2\,Mzz^TAB^{-1}:dA\big) - \big(2\,A^TMzz^TA:dB^{-1}\big) - \big(2\,B^{-1}A^TMzz^TA:dA^T\big) \cr &= -2\,Mzz^TAB^{-1}:dA \,\,- (0) - (0) \cr\cr \frac{\partial f}{\partial A} &= -2\,Mzz^TAB^{-1} \cr &= 2\,\Big(A(A^TA)^{-1}A^T-I\Big)zz^TA(A^TA)^{-1} \cr\cr }$$