Gradient and Hessian of Matrix Least Squares with Squared Frobenius Norm Regularization

545 Views Asked by At

I want to find the Gradeint and Hessian of the following function, $F(\mathbf{S}) = \frac{1}{2}\Vert \mathbf{M} - \mathbf{K_2SK_1^T}\Vert _F^2+\frac{1}{2}\Vert\mathbf{S}\Vert_F^2$. My try: Using trace formula for Frobenius norm, $F(\mathbf{X})$ can be writen as

$Grad(\mathbf{S}) =\frac{\partial}{\partial{S}}f(\mathbf{S})=\mathbf{K_2^T}(\mathbf{K_2SK_1^T}-\mathbf{M})\mathbf{K_1}+\mathbf{S}$
(cookbook eqn [70]&[84],updated accordding to greg's comments)

$Hess(\mathbf{S}) =\frac{\partial}{\partial{S}}Grad(\mathbf{S})=\mathbf{K_1^TK_1K_2^TK_2}+\mathbf{I}$

I got there results according to my understanding of the matrix cookbook. But obviously it's wrong. e.g. :$ K_2(5000,256)$,$ K_1(32,256)$,$ M(5000,1)$, $ S(256,256)$. In Grad(S),the dimensions are not matched for multiplication.

I'm not, particularly, sure about the chain rule in the Grad step whether I can write this. Please help me to get out of this confusion.

1

There are 1 best solutions below

6
On BEST ANSWER

Define the trace/Frobenius product as $$A:B={\rm Tr}(A^TB)={\rm Tr}(AB^T)$$ which make the Frobenius norm into a product $$\|X\|_F^2 = X:X$$ Next, define the matrix $$A=K_2SK_1^T-M$$ Write the cost function using the new notation and calculate its gradient. $$\eqalign{ F &= \tfrac{1}{2}S:S + \tfrac{1}{2}A:A \\ dF &= S:dS + A:dA \\ &= S:dS + A:K_2\,dS\,K_1^T \\ &= S:dS + K_2^TAK_1:dS \\ &= (S + K_2^TAK_1):dS \\ \frac{\partial F}{\partial S} &= (S + K_2^TAK_1) \;=\; G\quad\big({\rm the\,gradient}\big) \\ }$$ Flatten the various matrices into vectors. $$\eqalign{ s &= {\rm vec}(S) \\ a &= {\rm vec}(A) = (K_1\otimes K_2)s \\ }$$ Then vectorize $G$ and calculate its gradient (the Hessian). $$\eqalign{ g &= {\rm vec}(S + K_2^TAK_1) \\ &= s + (K_1^T\otimes K_2^T) a&\big({\rm the\,vectorized\,gradient}\big) \\ &= \Big(I + (K_1\otimes K_2)^T(K_1\otimes K_2)\Big)s \\ dg &= \Big(I + (K_1\otimes K_2)^T(K_1\otimes K_2)\Big)ds \\ \frac{\partial g}{\partial s} &= \Big(I + (K_1\otimes K_2)^T(K_1\otimes K_2)\Big) \;=\; H\quad&\big({\rm the\,Hessian}\big) \\ H&\in {\mathbb R}^{256^2\times 256^2} \\ }$$

Update

You can also calculate the Hessian in its un-flattened tensor form $$\eqalign{ dG &= dS + K_2^TdA\,K_1 \\ &= dS + K_2^TK_2\,dS\,K_1^TK_1 \\ &= \Big({\Omega} + K_2^TK_2{\Omega}K_1^TK_1\Big):dS \\ \frac{\partial G}{\partial S} &= \Big({\Omega} + K_2^TK_2{\Omega}K_1^TK_1\Big) \;=\; {\cal H} \\ {\cal H}&\in {\mathbb R}^{256\times 256\times 256\times 256} \\ }$$ where ${\Omega}$ is the fourth-order tensor which acts as the identity for the Frobenius product $$\eqalign{ {\Omega}:S = S:{\Omega} = S \\ }$$ and its components can be expressed in terms of Kronecker delta symbols $$\eqalign{ {\Omega}_{ijk\ell} = \delta_{ik}\delta_{j\ell} }$$ The action of the tensor Hessian on a matrix $B$ is calculated as $$\eqalign{ {\cal H}:B &= ({\Omega} + K_2^TK_2{\Omega}K_1^TK_1):B \\ &= B + K_2^TK_2BK_1^TK_1 \\ \\ }$$ NB: $\;$ The elements of the tensor ${\cal H}$ are exactly the same as those of the matrix $H$, they're just arranged differently. Neither solves your memory issue, since they're both equally huge.

The factored form in the comments does address the storage issue. Each of the three Kronecker factors is a $256\times 256$ matrix.