Gradient of norm of least square solution

148 Views Asked by At

It is easy to show that the solution to a least square problem is

$$\vec{w} = (X^TX)^{-1}X^T \vec{y}$$

In my case, the entries of matrix $X$ are filled from left to write with an added bias, meaning

\begin{bmatrix}x_{1,1}& \dots&x_{1,n}&1\\\vdots & \ddots & \vdots & \vdots\\x_{m,1}&\dots&x_{m,n}&1\end{bmatrix}

I would now like to take the gradient of the norm of $\vec{w}$ with respect to all $x_{i,j}$ going from $x_{1,1},...,x_{1,n},...x_{m,n}$. So

$$\nabla_x |\vec{w}| = \nabla_x |(X^TX)^{-1}X^T \vec{y}|$$

I have difficulties calculating this derivative. Was this done before or does anybody have a few tips how to calculate this?

Thanks in advance.

2

There are 2 best solutions below

2
On

You did not specify the norm, so I give an answer for the arbitrary scalar function $f:\mathbb{C}^{n}\rightarrow \mathbb{R}$:

$$ \frac{\partial f\left(\left(X^TX\right)^{-1}X^{T}y\right)}{\partial x_{ij}}= f^{'}\left(\left(X^TX\right)^{-1}X^{T}y\right)^T\frac{\partial }{\partial x_{ij}} \left(X^TX\right)^{-1}X^{T}y.$$

We are left with two terms

$$\left(\frac{\partial}{\partial x_{ij}} \left(X^TX\right)^{-1}\right)X^{T}y + \left(X^TX\right)^{-1}\left(\frac{\partial}{\partial x_{ij}} X^{T}\right)y.$$

The second one is trivial. The first one is somewhat harder. To find it, you can use that for arbitrary (invertible) matrix $A$ the following is true

$$ \frac{\partial}{\partial a_{ij}} \left(A^{-1} A\right) = 0, $$

hence

$$ \left[\frac{\partial}{\partial a_{ij}} \left(A^{-1}\right)\right]_{kl} = - \left[A^{-1}\right]_{ki}\left[A^{-1}\right]_{jl}. $$

0
On

If you're using the Frobenius norm, then you want the gradient of the scalar function $\phi$ which satisfies $$\eqalign{ \phi^2 &= \|w\|_F^2 = w:w \\ }$$ where the colon is a convenient product notation for the trace, i.e. $\,A:B = {\rm Tr}(A^TB)$

The $w$-vector is defined as $$\eqalign{ Q &= X^TX &\implies dQ = \big(X^TdX + dX^TX\big) \triangleq {\rm Sym}(X^TdX) \\ &&\implies dQ^{-1} = -Q^{-1}\,dQ\,Q^{-1} \\ w &= Q^{-1}X^Ty \quad&\implies dw = (Q^{-1}dX^T + dQ^{-1}X^T)y \\ }$$ Calculate the differential of $\phi,\,$ then its gradient. $$\eqalign{ 2\phi\,d\phi &= 2w:dw \\ d\phi &= \frac{1}{\phi} w:(Q^{-1}dX^T + dQ^{-1}X^T)y \\ &= \frac{1}{\phi} wy^T:\Big(Q^{-1}dX^T - Q^{-1}\,dQ\,Q^{-1}X^T\Big) \\ &= \frac{1}{\phi} Q^{-1}wy^T:dX^T - \frac{1}{\phi} Q^{-1}wy^TXQ^{-1}:dQ \\ &= \frac{1}{\phi} Q^{-1}wy^T:dX^T - \frac{1}{\phi} Q^{-1}ww^T:{\rm Sym}(X^TdX) \\ &= \frac{1}{\phi} yw^TQ^{-1}:dX - \frac{1}{\phi} \,{\rm Sym}(Q^{-1}ww^T):X^TdX \\ &= \frac{1}{\phi} \Big(yw^TQ^{-1} -XQ^{-1}ww^T -Xww^TQ^{-1}\Big):dX \\ \frac{\partial \phi}{\partial X} &= \frac{1}{\phi} \Big(yw^TQ^{-1} -XQ^{-1}ww^T -Xww^TQ^{-1}\Big) \\ }$$ Properties of the trace allow terms in a colon product to be rearranged in lots of different ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ A:BC &= AC^T:B \\ A:BC &= B^TA:C \\ A:{\rm Sym}(B) &= {\rm Sym}(A):B \\ }$$