I'm not very familiar with matrix derivative and was wondering what are the first two derivatives of the map $$ X\mapsto (X^TX + \lambda I)^{-1}X^Ty, $$ should be; where $y$ is a fixed vector and $X$ is a vector also.
Matrix Derivative of Tichonov Regularization Operator
282 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 2 best solutions below
On
Let's define some matrix variables
$$\eqalign{
M &= M^T = X^TX + \lambda I \cr
W &= W^T = M^{-1} \cr
Q &= WX^T \cr
}$$
some vector variables
$$\eqalign{
q &= Qy \cr
z &= y - Xq \cr
}$$
and three isotropic 4th-order tensors $({\mathcal A, B, C})$ with components
$$\eqalign{
{\mathcal A}_{ijkl} &= \delta_{ik}\,\delta_{jl} \cr
{\mathcal B}_{ijkl} &= \delta_{il}\,\delta_{jk} \cr
{\mathcal C}_{ijkl} &= \delta_{ij}\,\delta_{kl} \cr
}$$
We also need to define two special matrix products. The double-dot product performs a double-contraction on the right-most pair of indices of the tensor on the left with the leftmost indices of the tensor to the right, i.e.
$$\eqalign{
C &= A:B \cr
C_{ijm} &= \sum_{j,k} A_{ijk}\,B_{jkm} \cr
}$$
The Dyadic product forms a higher-order tensor without any contractions
$$\eqalign{
C &= A\star B \cr
C_{ijkmnp} &= A_{ijk}\,B_{mnp} \cr
}$$
For completeness, juxtaposition of tensors implies a single-contraction (or dot product) of the tensors on the rightmost+leftmost index, i.e.
$$\eqalign{
C &= AB \cr
C_{ijmn} &= \sum_{k} A_{ijk}\,B_{kmn} \cr\cr
}$$
With the notation out of they way, the function we want to differentiate can be expressed as
$$\eqalign{
q &= Qy = WX^Ty \cr
}$$
Find the differential
$$\eqalign{
dq
&= W\,dX^T\,y + dW\,X^Ty \cr
&= W\,dX^T\,y - W\,dM\,W\,X^Ty \cr
&= W\,dX^T\,y - W\,(dX^T\,X+X^T\,dX)\,q \cr
&= W\,dX^T\,(y-Xq) - Q\,dX\,q \cr
&= \Big(W\,{\mathcal C}\,z - Q\,{\mathcal A}\,q\Big):dX \cr
&= \Big(W\star z - Q\,{\mathcal A}\,q\Big):dX \cr
}$$
So the gradient is
$$\eqalign{
G &= \frac{\partial q}{\partial X} &= W\star z - Q\,{\mathcal A}\,q \cr
}$$
If you wish to find the derivative with respect to the scalar components of $X$, take the double-dot product with the corresponding single-entry matrix
$$\eqalign{
\frac{\partial q}{\partial X_{ij}}
&= G:E_{ij} \cr
&= W\,E_{ij}^T\,z - Q\,E_{ij}\,q \cr
&= W\,e_{j}e_{i}^T\,z - Q\,e_{i}e_{j}^T\,q \cr\cr
}$$
Some interesting properties of the products of the isotropic tensors with matrices $(X,Y,Z)$ and vectors $(v,w)$ are
$$\eqalign{
{\mathcal A}:X &= X:{\mathcal A} = X \cr
{\mathcal B}:X &= X:{\mathcal B} = X^T \cr
X\,{\mathcal C}\,Y &= X\star Y \cr
X\,Y\,Z &= (X\,{\mathcal A}\,Z^T) : Y\cr
(X\,{\mathcal A}\,w):{\mathcal B} &= X\,{\mathcal C}\,w = X\star w \cr\cr
}$$
Update
Now you want to find the gradient of the gradient. This will be a 5th order tensor, which you will find is useless for computational (or any other) purposes. Nonetheless, let's start by finding the differential of the gradient
$$\eqalign{
dG &= dW\,{\mathcal C}\,z + W\,{\mathcal C}\,dz - dQ\,{\mathcal A}\,q - Q\,{\mathcal A}\,dq \cr
}$$
so it starts off simple enough, but soon expands into the complicated mess. Despite the mess, the hessian $H$, satisfies a very simple equation
$$\eqalign{
dG &= H:dX \cr\cr
}$$
The expression for the hessian is
$$\eqalign{
H_a = ({\mathcal B}:(W{\mathcal A}Q^TX^T)){\mathcal A}q
+ Q^T{\mathcal A}Q{\mathcal A}q
- ({\mathcal B}:(W{\mathcal A}Q^T)){\mathcal C}z
- Q^T{\mathcal A}W{\mathcal C}z
-({\mathcal B}:(W{\mathcal A})){\mathcal A}q
\cr
H_b = W{\mathcal C}Q^T{\mathcal C}Xq
+ W{\mathcal C}Q^TX{\mathcal A}q
+ Q{\mathcal A}W{\mathcal C}Xq
+ Q{\mathcal A}Q{\mathcal A}q
- Q{\mathcal A}W{\mathcal C}y
- W{\mathcal C}Q^T{\mathcal C}y
- W{\mathcal C}{\mathcal A}q
\cr\cr
}$$
We need to add these two parts together, but the indices on the first part need to be cyclically shifted 2 places to the left. There is no standard notation for this so let's use the following
$$\eqalign{
M_{ijklm}^{\leftarrow} = M_{klmij}
}$$
Then the hessian is given by
$$\eqalign{
H &= H_a^{\leftarrow} + H_b
}$$
Let's say $X$ is an $n\times p$ matrix. Then the $X\mapsto (X^T X+\lambda I)^{-1}X^T y$ mapping has an $p$-dimensional image and an $n\times p$ dimensional domain, which makes the derivatives quite complicated (a $n\times n\times p$ tensor).
Here I derive the partial derivative of $(X^TX +\lambda I)^{-1} X^T y$ with respect to $X_{ij}$, where $i\in[n]$ and $j\in[p]$. This partial derivative is a $p$-dimensional vector. You can assemble the $n\times p$ partial derivatives to get the full answer to your question. The derivation relies heavily on the matrix cookbook.
First, using Eq. (59) in the matrix cookbook, we have $$ \frac{\partial (X^T X+\lambda I)^{-1}}{\partial X_{ij}} = -(X^TX+\lambda I)^{-1}\frac{\partial(X^T X+\lambda I)}{\partial X_{ij}}(X^TX + \lambda I)^{-1}. $$ Using Eq. (80) in the matrix cookbook, we have $$ \frac{\partial(X^T X+\lambda I)}{\partial X_{ij}} = \frac{\partial X^T X}{\partial X_{ij}} = X^T J^{ij} + (J^{ij})^TX, $$ where $J^{ij}$ is an $n\times p$ matrix defined as $J^{ij}_{k\ell} = \delta_{ik}\delta_{j\ell}$, where $\delta$ is the Kronecker's delta function.
In addition, following simple definition we have that $$ \frac{\partial X^T y}{\partial X_{ij}} = y_ie_j, $$ where $e_j$ is an $p$-dimensional vector with the $j$th element being one, and all the other elements being zero.
We are now ready to give the partial derivative of $(X^T X+\lambda I)^{-1}X^T y$ with respect to $X_{ij}$ using chain rule: \begin{equation} \begin{aligned} & \frac{\partial(X^TX+\lambda I)^{-1}X^T y}{\partial X_{ij}}\\ = & \frac{\partial (X^TX+\lambda I)^{-1}}{\partial X_{ij}}X^T y + (X^T X+\lambda I)^{-1}\frac{\partial X^T y}{\partial X_{ij}}\\ =& -(X^T X+\lambda I)^{-1}(X^TJ^{ij}+(J^{ij})^TX)(X^TX+\lambda I)^{-1}X^T y + (X^T X+\lambda I)^{-1}(y_i\cdot e_j)\\ =&-(X^T X+\lambda I)^{-1}\left[(X^TJ^{ij}+(J^{ij})^TX)(X^TX+\lambda I)^{-1}X^T y - (y_i\cdot e_j)\right] . \end{aligned} \end{equation}
To obtain the second-order derivative (i.e., Hessian), we should re-use some of the terms in the first-order derivative expression. Define $\phi_{ij} := \partial(X^T X+\lambda I)^{-1}/\partial X_{ij}$ and $\psi_{ij} := \partial(X^T X+\lambda I)^{-1}X^T y/\partial X_{ij}$, which are $p$-dimensional vector. By chain rule, we have that \begin{align} &\frac{\partial^2(X^T X+\lambda I)^{-1}}{\partial X_{ij}\partial X_{k\ell}}\\ &= -\phi_{k\ell}^T\left[(X^TJ^{ij}+(J^{ij})^TX)(X^TX+\lambda I)^{-1}X^T y - (y_i\cdot e_j)\right]\\ &- (X^T X+\lambda I)^{-1}\left[((J^{k\ell})^TJ^{ij} + (J^{ij})^TJ^{k\ell})(X^T X+\lambda I)^{-1}X^T y + (X^TJ^{ij}+(J^{ij})^T X)\psi_{k\ell}\right]. \end{align}