I want to find the Gradeint and Hessian of the following function, $F(\mathbf{S}) = \frac{1}{2}\Vert \mathbf{M} - \mathbf{K_2SK_1^T}\Vert _F^2+\frac{1}{2}\Vert\mathbf{S}\Vert_F^2$. My try: Using trace formula for Frobenius norm, $F(\mathbf{X})$ can be writen as
$Grad(\mathbf{S}) =\frac{\partial}{\partial{S}}f(\mathbf{S})=\mathbf{K_2^T}(\mathbf{K_2SK_1^T}-\mathbf{M})\mathbf{K_1}+\mathbf{S}$
(cookbook eqn [70]&[84],updated accordding to greg's comments)
$Hess(\mathbf{S}) =\frac{\partial}{\partial{S}}Grad(\mathbf{S})=\mathbf{K_1^TK_1K_2^TK_2}+\mathbf{I}$
I got there results according to my understanding of the matrix cookbook. But obviously it's wrong. e.g. :$ K_2(5000,256)$,$ K_1(32,256)$,$ M(5000,1)$, $ S(256,256)$. In Grad(S),the dimensions are not matched for multiplication.
I'm not, particularly, sure about the chain rule in the Grad step whether I can write this. Please help me to get out of this confusion.
Define the trace/Frobenius product as $$A:B={\rm Tr}(A^TB)={\rm Tr}(AB^T)$$ which make the Frobenius norm into a product $$\|X\|_F^2 = X:X$$ Next, define the matrix $$A=K_2SK_1^T-M$$ Write the cost function using the new notation and calculate its gradient. $$\eqalign{ F &= \tfrac{1}{2}S:S + \tfrac{1}{2}A:A \\ dF &= S:dS + A:dA \\ &= S:dS + A:K_2\,dS\,K_1^T \\ &= S:dS + K_2^TAK_1:dS \\ &= (S + K_2^TAK_1):dS \\ \frac{\partial F}{\partial S} &= (S + K_2^TAK_1) \;=\; G\quad\big({\rm the\,gradient}\big) \\ }$$ Flatten the various matrices into vectors. $$\eqalign{ s &= {\rm vec}(S) \\ a &= {\rm vec}(A) = (K_1\otimes K_2)s \\ }$$ Then vectorize $G$ and calculate its gradient (the Hessian). $$\eqalign{ g &= {\rm vec}(S + K_2^TAK_1) \\ &= s + (K_1^T\otimes K_2^T) a&\big({\rm the\,vectorized\,gradient}\big) \\ &= \Big(I + (K_1\otimes K_2)^T(K_1\otimes K_2)\Big)s \\ dg &= \Big(I + (K_1\otimes K_2)^T(K_1\otimes K_2)\Big)ds \\ \frac{\partial g}{\partial s} &= \Big(I + (K_1\otimes K_2)^T(K_1\otimes K_2)\Big) \;=\; H\quad&\big({\rm the\,Hessian}\big) \\ H&\in {\mathbb R}^{256^2\times 256^2} \\ }$$
Update
You can also calculate the Hessian in its un-flattened tensor form $$\eqalign{ dG &= dS + K_2^TdA\,K_1 \\ &= dS + K_2^TK_2\,dS\,K_1^TK_1 \\ &= \Big({\Omega} + K_2^TK_2{\Omega}K_1^TK_1\Big):dS \\ \frac{\partial G}{\partial S} &= \Big({\Omega} + K_2^TK_2{\Omega}K_1^TK_1\Big) \;=\; {\cal H} \\ {\cal H}&\in {\mathbb R}^{256\times 256\times 256\times 256} \\ }$$ where ${\Omega}$ is the fourth-order tensor which acts as the identity for the Frobenius product $$\eqalign{ {\Omega}:S = S:{\Omega} = S \\ }$$ and its components can be expressed in terms of Kronecker delta symbols $$\eqalign{ {\Omega}_{ijk\ell} = \delta_{ik}\delta_{j\ell} }$$ The action of the tensor Hessian on a matrix $B$ is calculated as $$\eqalign{ {\cal H}:B &= ({\Omega} + K_2^TK_2{\Omega}K_1^TK_1):B \\ &= B + K_2^TK_2BK_1^TK_1 \\ \\ }$$ NB: $\;$ The elements of the tensor ${\cal H}$ are exactly the same as those of the matrix $H$, they're just arranged differently. Neither solves your memory issue, since they're both equally huge.The factored form in the comments does address the storage issue. Each of the three Kronecker factors is a $256\times 256$ matrix.