Hessian on linear least squares problem

11.1k Views Asked by At

I tried to calculate the Hessian matrix of linear least squares problem (L-2 norm), in particular:

$$f(x) = \|AX - B \|_2$$ where $f:{\rm I\!R}^{11\times 2}\rightarrow {\rm I\!R}$

Can someone help me?
Thanks a lot.

5

There are 5 best solutions below

0
On

Calculate first the gradient vector: use the chain rule and calculate the partial derivatives of $f(x)$ w.r.t $x \in R^n$. You will get a function that eats a vector and produce other "vector" $g(x) \in R^n$ (well this is an abuse of notation and terminology, $g(x)$ produces a vector of functions not a vector in $R^n$ so it is really a "vector operator").

Then you will take the partial derivatives of $g(x)$ w.r.t $x$ again applying the chain rule. For that you can see $g(x)$ as a vector of simpler functions $g_i(x) \in R$ each of which eats a vector and produces a scalar value.

So for each dimension of $g(x)$ you have a function $g_i(x) \in R$. So taking the partial derivatives of $g(x)$ w.r.t $x$ amounts to taking the partial derivatives of $g_i(x) \in R$ w.r.t $x$ and put them toguether. That is the Hessian matrix.

In the same way that we see the derivative of $f(x)$ w.r.t $x$ is producing a vector operator, we can see the derivative of $g_i(x)$ w.r.t $x$ as producing a vector operator and hence the derivative of $g(x)$ w.r.t $x$ is producing a matrix operator named Hessian matrix.

0
On

Yep squared norm is better.

$$\|AX-B\|_F^2 = (AX-B)^T(AX-B) = \Big/\text{ simplify }\Big/ = X^TA^TAX + \text{linear & const terms}$$

Now you should see what the Hessian is. If you still don't you can check out Hessian matrix - use in optimization.


If linear problem then the Hessian is directly in the second order term, if non-linear problem solved by trust-region approach it is matrix on second term of Taylor expansion around trust region.

2
On

Let $f:\mathbb R^n \to \mathbb R$ be defined by $$ f(x)=\frac12 \|Ax-b\|^2. $$ Notice that $f(x)=g(h(x))$, where $h(x)=Ax-b$ and $g(y) = \frac12 \|y\|^2$. The derivatives of $g$ and $h$ are given by $$ g'(y)=y^T, \quad h'(x)=A. $$ The chain rule tells us that $$ f'(x)=g'(h(x))h'(x) = (Ax-b)^T A. $$ If we use the convention that the gradient is a column vector, then $$ \nabla f(x)=f'(x)^T=A^T(Ax-b). $$

The Hessian $Hf(x)$ is the derivative of the function $x \mapsto \nabla f(x)$, so: $$ Hf(x)= A^T A. $$

0
On

Let $f : \mathbb R^{m \times n} \to \mathbb R$ be defined by

$$f (\mathrm X) := \frac 12 \| \mathrm A \mathrm X - \mathrm B \|_{\text{F}}^2 = \frac 12 \| (\mathrm I_n \otimes \mathrm A) \, \mbox{vec} (\mathrm X) - \mbox{vec} (\mathrm B) \|_2^2$$

where $\mbox{vec}$ is the vectorization operator and $\otimes$ is the Kronecker product. Thus, the Hessian of $f$ is

$$(\mathrm I_n \otimes \mathrm A)^\top (\mathrm I_n \otimes \mathrm A) = \mathrm I_n \otimes \mathrm A^\top \mathrm A$$

0
On

Define a new matrix $P=(AX-B)$ and write the function as $$f=\|P\|_F^2 = P:P$$ where the colon denotes the trace/Frobenius product, i.e. $\,\,A:B={\rm tr}(A^TB)$

Find the differential and gradient of $f$ $$\eqalign{ df &= 2P:dP = 2P:A\,dX = 2A^TP:dX \cr G &= \frac{\partial f}{\partial X} = 2A^TP \cr }$$ Now find the differential and gradient of $G$ $$\eqalign{ dG &= 2A^T\,dP = 2A^TA\,dX = 2A^TA{\mathcal E}:dX \cr {\mathcal H} &= \frac{\partial G}{\partial X} = 2A^TA{\mathcal E} \cr }$$ Note that both $({\mathcal H},{\mathcal E})$ are fourth-order tensors, the latter having components $${\mathcal E}_{ijkl} = \delta_{ik} \delta_{jl}$$ So far everyone has answered a modified form of your question by squaring the function.
If you truly need the hessian of your original function, here it is $$\eqalign{ f &= \|P\| \cr f^2 &= \|P\|^2 = P:P \cr f\,df &= P:dP = A^TP:dX \cr G &= A^TP\,f^{-1} \cr dG &= A^T\,dP\,f^{-1} - A^TP\,df\,f^{-2} \cr &= f^{-1}A^TA\,dX - f^{-3}(A^TP)(A^TP:dX) \cr &= \Big[f^{-1}A^TA{\mathcal E} - f^{-3}(A^TP)\star(A^TP)\Big]:dX \cr {\mathcal H} &= f^{-1}A^TA{\mathcal E} - f^{-3}(A^TP)\star(A^TP) \cr }$$ where $\star$ is the tensor product, i.e. $${\mathcal M}=B\star C \implies {\mathcal M}_{ijkl} = B_{ij}\,C_{kl}$$