I tried to calculate the Hessian matrix of linear least squares problem (L-2 norm), in particular:
$$f(x) = \|AX - B \|_2$$ where $f:{\rm I\!R}^{11\times 2}\rightarrow {\rm I\!R}$
Can someone help me?
Thanks a lot.
I tried to calculate the Hessian matrix of linear least squares problem (L-2 norm), in particular:
$$f(x) = \|AX - B \|_2$$ where $f:{\rm I\!R}^{11\times 2}\rightarrow {\rm I\!R}$
Can someone help me?
Thanks a lot.
On
Yep squared norm is better.
$$\|AX-B\|_F^2 = (AX-B)^T(AX-B) = \Big/\text{ simplify }\Big/ = X^TA^TAX + \text{linear & const terms}$$
Now you should see what the Hessian is. If you still don't you can check out Hessian matrix - use in optimization.
If linear problem then the Hessian is directly in the second order term, if non-linear problem solved by trust-region approach it is matrix on second term of Taylor expansion around trust region.
On
Let $f:\mathbb R^n \to \mathbb R$ be defined by $$ f(x)=\frac12 \|Ax-b\|^2. $$ Notice that $f(x)=g(h(x))$, where $h(x)=Ax-b$ and $g(y) = \frac12 \|y\|^2$. The derivatives of $g$ and $h$ are given by $$ g'(y)=y^T, \quad h'(x)=A. $$ The chain rule tells us that $$ f'(x)=g'(h(x))h'(x) = (Ax-b)^T A. $$ If we use the convention that the gradient is a column vector, then $$ \nabla f(x)=f'(x)^T=A^T(Ax-b). $$
The Hessian $Hf(x)$ is the derivative of the function $x \mapsto \nabla f(x)$, so: $$ Hf(x)= A^T A. $$
On
Let $f : \mathbb R^{m \times n} \to \mathbb R$ be defined by
$$f (\mathrm X) := \frac 12 \| \mathrm A \mathrm X - \mathrm B \|_{\text{F}}^2 = \frac 12 \| (\mathrm I_n \otimes \mathrm A) \, \mbox{vec} (\mathrm X) - \mbox{vec} (\mathrm B) \|_2^2$$
where $\mbox{vec}$ is the vectorization operator and $\otimes$ is the Kronecker product. Thus, the Hessian of $f$ is
$$(\mathrm I_n \otimes \mathrm A)^\top (\mathrm I_n \otimes \mathrm A) = \mathrm I_n \otimes \mathrm A^\top \mathrm A$$
On
Define a new matrix $P=(AX-B)$ and write the function as $$f=\|P\|_F^2 = P:P$$ where the colon denotes the trace/Frobenius product, i.e. $\,\,A:B={\rm tr}(A^TB)$
Find the differential and gradient of $f$
$$\eqalign{
df &= 2P:dP = 2P:A\,dX = 2A^TP:dX \cr
G &= \frac{\partial f}{\partial X} = 2A^TP \cr
}$$
Now find the differential and gradient of $G$
$$\eqalign{
dG &= 2A^T\,dP = 2A^TA\,dX = 2A^TA{\mathcal E}:dX \cr
{\mathcal H} &= \frac{\partial G}{\partial X} = 2A^TA{\mathcal E} \cr
}$$
Note that both $({\mathcal H},{\mathcal E})$ are fourth-order tensors, the latter having components
$${\mathcal E}_{ijkl} = \delta_{ik} \delta_{jl}$$
So far everyone has answered a modified form of your question by squaring the function.
If you truly need the hessian of your original function, here it is
$$\eqalign{
f &= \|P\| \cr
f^2 &= \|P\|^2 = P:P \cr
f\,df &= P:dP = A^TP:dX \cr
G &= A^TP\,f^{-1} \cr
dG &= A^T\,dP\,f^{-1} - A^TP\,df\,f^{-2} \cr
&= f^{-1}A^TA\,dX - f^{-3}(A^TP)(A^TP:dX) \cr
&= \Big[f^{-1}A^TA{\mathcal E} - f^{-3}(A^TP)\star(A^TP)\Big]:dX \cr
{\mathcal H} &= f^{-1}A^TA{\mathcal E} - f^{-3}(A^TP)\star(A^TP) \cr
}$$
where $\star$ is the tensor product, i.e.
$${\mathcal M}=B\star C \implies {\mathcal M}_{ijkl} = B_{ij}\,C_{kl}$$
Calculate first the gradient vector: use the chain rule and calculate the partial derivatives of $f(x)$ w.r.t $x \in R^n$. You will get a function that eats a vector and produce other "vector" $g(x) \in R^n$ (well this is an abuse of notation and terminology, $g(x)$ produces a vector of functions not a vector in $R^n$ so it is really a "vector operator").
Then you will take the partial derivatives of $g(x)$ w.r.t $x$ again applying the chain rule. For that you can see $g(x)$ as a vector of simpler functions $g_i(x) \in R$ each of which eats a vector and produces a scalar value.
So for each dimension of $g(x)$ you have a function $g_i(x) \in R$. So taking the partial derivatives of $g(x)$ w.r.t $x$ amounts to taking the partial derivatives of $g_i(x) \in R$ w.r.t $x$ and put them toguether. That is the Hessian matrix.
In the same way that we see the derivative of $f(x)$ w.r.t $x$ is producing a vector operator, we can see the derivative of $g_i(x)$ w.r.t $x$ as producing a vector operator and hence the derivative of $g(x)$ w.r.t $x$ is producing a matrix operator named Hessian matrix.