Matrix Derivative of Tichonov Regularization Operator

282 Views Asked by At

I'm not very familiar with matrix derivative and was wondering what are the first two derivatives of the map $$ X\mapsto (X^TX + \lambda I)^{-1}X^Ty, $$ should be; where $y$ is a fixed vector and $X$ is a vector also.

2

There are 2 best solutions below

8
On BEST ANSWER

Let's say $X$ is an $n\times p$ matrix. Then the $X\mapsto (X^T X+\lambda I)^{-1}X^T y$ mapping has an $p$-dimensional image and an $n\times p$ dimensional domain, which makes the derivatives quite complicated (a $n\times n\times p$ tensor).

Here I derive the partial derivative of $(X^TX +\lambda I)^{-1} X^T y$ with respect to $X_{ij}$, where $i\in[n]$ and $j\in[p]$. This partial derivative is a $p$-dimensional vector. You can assemble the $n\times p$ partial derivatives to get the full answer to your question. The derivation relies heavily on the matrix cookbook.

First, using Eq. (59) in the matrix cookbook, we have $$ \frac{\partial (X^T X+\lambda I)^{-1}}{\partial X_{ij}} = -(X^TX+\lambda I)^{-1}\frac{\partial(X^T X+\lambda I)}{\partial X_{ij}}(X^TX + \lambda I)^{-1}. $$ Using Eq. (80) in the matrix cookbook, we have $$ \frac{\partial(X^T X+\lambda I)}{\partial X_{ij}} = \frac{\partial X^T X}{\partial X_{ij}} = X^T J^{ij} + (J^{ij})^TX, $$ where $J^{ij}$ is an $n\times p$ matrix defined as $J^{ij}_{k\ell} = \delta_{ik}\delta_{j\ell}$, where $\delta$ is the Kronecker's delta function.

In addition, following simple definition we have that $$ \frac{\partial X^T y}{\partial X_{ij}} = y_ie_j, $$ where $e_j$ is an $p$-dimensional vector with the $j$th element being one, and all the other elements being zero.

We are now ready to give the partial derivative of $(X^T X+\lambda I)^{-1}X^T y$ with respect to $X_{ij}$ using chain rule: \begin{equation} \begin{aligned} & \frac{\partial(X^TX+\lambda I)^{-1}X^T y}{\partial X_{ij}}\\ = & \frac{\partial (X^TX+\lambda I)^{-1}}{\partial X_{ij}}X^T y + (X^T X+\lambda I)^{-1}\frac{\partial X^T y}{\partial X_{ij}}\\ =& -(X^T X+\lambda I)^{-1}(X^TJ^{ij}+(J^{ij})^TX)(X^TX+\lambda I)^{-1}X^T y + (X^T X+\lambda I)^{-1}(y_i\cdot e_j)\\ =&-(X^T X+\lambda I)^{-1}\left[(X^TJ^{ij}+(J^{ij})^TX)(X^TX+\lambda I)^{-1}X^T y - (y_i\cdot e_j)\right] . \end{aligned} \end{equation}

To obtain the second-order derivative (i.e., Hessian), we should re-use some of the terms in the first-order derivative expression. Define $\phi_{ij} := \partial(X^T X+\lambda I)^{-1}/\partial X_{ij}$ and $\psi_{ij} := \partial(X^T X+\lambda I)^{-1}X^T y/\partial X_{ij}$, which are $p$-dimensional vector. By chain rule, we have that \begin{align} &\frac{\partial^2(X^T X+\lambda I)^{-1}}{\partial X_{ij}\partial X_{k\ell}}\\ &= -\phi_{k\ell}^T\left[(X^TJ^{ij}+(J^{ij})^TX)(X^TX+\lambda I)^{-1}X^T y - (y_i\cdot e_j)\right]\\ &- (X^T X+\lambda I)^{-1}\left[((J^{k\ell})^TJ^{ij} + (J^{ij})^TJ^{k\ell})(X^T X+\lambda I)^{-1}X^T y + (X^TJ^{ij}+(J^{ij})^T X)\psi_{k\ell}\right]. \end{align}

1
On

Let's define some matrix variables $$\eqalign{ M &= M^T = X^TX + \lambda I \cr W &= W^T = M^{-1} \cr Q &= WX^T \cr }$$ some vector variables $$\eqalign{ q &= Qy \cr z &= y - Xq \cr }$$ and three isotropic 4th-order tensors $({\mathcal A, B, C})$ with components $$\eqalign{ {\mathcal A}_{ijkl} &= \delta_{ik}\,\delta_{jl} \cr {\mathcal B}_{ijkl} &= \delta_{il}\,\delta_{jk} \cr {\mathcal C}_{ijkl} &= \delta_{ij}\,\delta_{kl} \cr }$$ We also need to define two special matrix products. The double-dot product performs a double-contraction on the right-most pair of indices of the tensor on the left with the leftmost indices of the tensor to the right, i.e. $$\eqalign{ C &= A:B \cr C_{ijm} &= \sum_{j,k} A_{ijk}\,B_{jkm} \cr }$$ The Dyadic product forms a higher-order tensor without any contractions $$\eqalign{ C &= A\star B \cr C_{ijkmnp} &= A_{ijk}\,B_{mnp} \cr }$$ For completeness, juxtaposition of tensors implies a single-contraction (or dot product) of the tensors on the rightmost+leftmost index, i.e. $$\eqalign{ C &= AB \cr C_{ijmn} &= \sum_{k} A_{ijk}\,B_{kmn} \cr\cr }$$ With the notation out of they way, the function we want to differentiate can be expressed as $$\eqalign{ q &= Qy = WX^Ty \cr }$$ Find the differential $$\eqalign{ dq &= W\,dX^T\,y + dW\,X^Ty \cr &= W\,dX^T\,y - W\,dM\,W\,X^Ty \cr &= W\,dX^T\,y - W\,(dX^T\,X+X^T\,dX)\,q \cr &= W\,dX^T\,(y-Xq) - Q\,dX\,q \cr &= \Big(W\,{\mathcal C}\,z - Q\,{\mathcal A}\,q\Big):dX \cr &= \Big(W\star z - Q\,{\mathcal A}\,q\Big):dX \cr }$$ So the gradient is $$\eqalign{ G &= \frac{\partial q}{\partial X} &= W\star z - Q\,{\mathcal A}\,q \cr }$$ If you wish to find the derivative with respect to the scalar components of $X$, take the double-dot product with the corresponding single-entry matrix $$\eqalign{ \frac{\partial q}{\partial X_{ij}} &= G:E_{ij} \cr &= W\,E_{ij}^T\,z - Q\,E_{ij}\,q \cr &= W\,e_{j}e_{i}^T\,z - Q\,e_{i}e_{j}^T\,q \cr\cr }$$ Some interesting properties of the products of the isotropic tensors with matrices $(X,Y,Z)$ and vectors $(v,w)$ are $$\eqalign{ {\mathcal A}:X &= X:{\mathcal A} = X \cr {\mathcal B}:X &= X:{\mathcal B} = X^T \cr X\,{\mathcal C}\,Y &= X\star Y \cr X\,Y\,Z &= (X\,{\mathcal A}\,Z^T) : Y\cr (X\,{\mathcal A}\,w):{\mathcal B} &= X\,{\mathcal C}\,w = X\star w \cr\cr }$$
Update

Now you want to find the gradient of the gradient. This will be a 5th order tensor, which you will find is useless for computational (or any other) purposes. Nonetheless, let's start by finding the differential of the gradient $$\eqalign{ dG &= dW\,{\mathcal C}\,z + W\,{\mathcal C}\,dz - dQ\,{\mathcal A}\,q - Q\,{\mathcal A}\,dq \cr }$$ so it starts off simple enough, but soon expands into the complicated mess. Despite the mess, the hessian $H$, satisfies a very simple equation $$\eqalign{ dG &= H:dX \cr\cr }$$ The expression for the hessian is $$\eqalign{ H_a = ({\mathcal B}:(W{\mathcal A}Q^TX^T)){\mathcal A}q + Q^T{\mathcal A}Q{\mathcal A}q - ({\mathcal B}:(W{\mathcal A}Q^T)){\mathcal C}z - Q^T{\mathcal A}W{\mathcal C}z -({\mathcal B}:(W{\mathcal A})){\mathcal A}q \cr H_b = W{\mathcal C}Q^T{\mathcal C}Xq + W{\mathcal C}Q^TX{\mathcal A}q + Q{\mathcal A}W{\mathcal C}Xq + Q{\mathcal A}Q{\mathcal A}q - Q{\mathcal A}W{\mathcal C}y - W{\mathcal C}Q^T{\mathcal C}y - W{\mathcal C}{\mathcal A}q \cr\cr }$$ We need to add these two parts together, but the indices on the first part need to be cyclically shifted 2 places to the left. There is no standard notation for this so let's use the following $$\eqalign{ M_{ijklm}^{\leftarrow} = M_{klmij} }$$ Then the hessian is given by $$\eqalign{ H &= H_a^{\leftarrow} + H_b }$$