What is the Hessian of Frobenius norm

2k Views Asked by At

As we know that every norm is convex, and if a function is convex w.r.t. the input variable, then corresponding Hessian should be positive semidefinite. When I try to find the Hessian of Frobenius norm, I can't obtain expected result. I compute the Hessian as follows

I first compute the first order derivative: $\frac{\partial \||X\||_F^2}{\partial X} = 2X$, and I refer to the Other matrix derivatives section in wiki http://en.wikipedia.org/wiki/Matrix_calculus to compute the $\frac{\partial 2X}{\partial X}$. However, it seems that if I use that formula, I won't get desired result.

Hope someone could tell me what is wrong with my derivation. Thanks very much.

2

There are 2 best solutions below

2
On BEST ANSWER

When working with the Frobenius norm, we can stack the columns of $X$ into a single vector $\widehat X\in \mathbb R^{n^2}$ and take the norm of that. This simplifies the computation: the gradient is $2\widehat X$ and the Hessian is $2I$ where $I$ is the identity matrix acting on $\mathbb R^{n^2}$; i.e., the identity matrix of size $n^2\times n^2$.

If you insist on matrix notation for $X$, then notice that the derivative of $X\mapsto 2X$ with respect to $X$ does not naturally fit in a matrix form; matrix-by-matrix derivatives are absent from the table in the wiki article you mentioned.

Notice that we could also talk about the derivative of a vector with respect to a matrix, or any of the other unfilled cells in our table. However, these derivatives are most naturally organized in a tensor of rank higher than 2, so that they do not fit neatly into a matrix

You would need a tensor of order 4 to express the Hessian in this case.

1
On

More generally, consider $n\times n$ real matrices $A,X$ and the function $f:X\rightarrow ||AX||^2=trace(AXX^TA^T)$. Then its derivative is $D_Xf:H\rightarrow 2trace(X^TA^TAH)$. moreover its Hessian is $Hess_X(f):(H,K)\rightarrow 2trace(K^TA^TAH)$, a symmetric bilinear form that does not depend on $X$. Let $(E_{i,j})$ be the canonical basis of $M_n$. Then, if $j\not=l$, then $Hess_X(f)(E_{i,j},E_{k,l})=0$ and $Hess_X(f)(E_{i,j},E_{k,j})=2u_{i,k}$ where $u_{i,j}$ is the $(i,j)$ entry of $A^TA$.

Let $vec(U)$ denotes the vectorization of a $n\times n$ matrix $U$, formed by stacking the rows (resp. the columns) of $U$ into a single column vector. Then the $n^2\times n^2$ symmetric matrix associated to $Hess_X(f)$ is $2A^TA\otimes I_n$ (resp. $2I_n\otimes A^TA$). cf.:

http://en.wikipedia.org/wiki/Kronecker_product

The eigenvalues $(\lambda_i)_i$ of $A^TA$ are non-negative. The eigenvalues of $2A^TA\otimes I_n$ and $2I_n\otimes A^TA$ are $(2\lambda_i)_i$, $n$ times each ; hopefully, $Hess_X(f)$ is sym. non-negative (semidefinite). In particular, if $A=I_n$, then $Hess_X(f)$ is associated to the matrix $2I_n\otimes I_n=2I_{n^2}$ as 127 wrote above.

EDIT: Charles, that you write is the gradient $\nabla_X(f)=2A^TAX$. The gradient is defined from the matrix scalar product $(U,V)=trace(U^TV)$ by the formula $(\nabla_X(f),H)=D_Xf(H)$. We conclude by identification.