Deriving the Jacobian and Hessian of the nonlinear least-squares function

1k Views Asked by At

I'm working on a problem that involves deriving the Jacobian and Hessian of the following nonlinear least squares function. I've been thinking of expanding the function with Taylor Series expansion then try to match and find the Jacobian and Hessian, but I'm stuck right now.

$f(x) = \sum_1^m [y_i - (a_i^Tx)^2]^2$

Where $x \in \Bbb R^{n}$, $a_i^T$ is the row vector in $x \in \Bbb R^{n}$.

Multiple sources have the derivations in a general form which doesn't allow me to find the gradients in a nice vector-matrix form. Any help/hint would be incredibly helpful.

2

There are 2 best solutions below

1
On BEST ANSWER

Here's my attempt to do it in a fairly clean way.

Your function $f$ can be written as $$ f(x) = \sum_{i=1}^m (y_i - h_i(x))^2 $$ where $h_i(x) = (a_i^T x)^2$. By the chain rule, \begin{align} f'(x) &= \sum_{i=1}^m 2(y_i - h_i(x)) (-h_i'(x)) \\ &= \sum_{i=1}^m 2(y_i - (a_i^T x)^2) (-2 (a_i^T x) a_i^T) \\ &= -4 \sum_{i=1}^m (y_i - (a_i^T x)^2) x^T a_i a_i^T. \end{align} Note that $f'(x)$ is a row vector. The gradient of $f$ is the column vector $$ \nabla f(x) = f'(x)^T = -4 \sum_{i=1}^m a_i a_i^T x (y_i - (a_i^T x)^2). $$ The Hessian of $f$ is the derivative of the function $G(x) = \nabla f(x)$. One way to compute the Hessian of $f$ is to notice that

\begin{align} G(x) = \nabla f(x) &= -4 \sum_{i=1}^m a_i g_i(a_i^T x), \end{align} where $g_i(u) = u(y_i - u^2) = u y_i - u^3$. The derivative of $g_i$ is $g_i'(u) = y_i - 3u^2$. So we have \begin{align} Hf(x) = G'(x) &= -4 \sum_{i=1}^m a_i g_i'(a_i^T x) a_i^T \\ &= -4 \sum_{i=1}^m a_i (y_i - 3 (a_i^T x)^2) a_i^T. \end{align}

1
On

For typing convenience, define the variables $$\eqalign{ A &= \big[\matrix{a_1&a_2&\ldots&a_m}\big]^T \\ w &= Ax,\qquad\qquad W = {\rm Diag}(w),\quad &dw = A\,dx \\ p &= w\odot w-y,\quad P = {\rm Diag}(p),\quad &dp = 2w\odot dw = 2WA\,dx \\ }$$ where $(\odot)$ represents the elementwise/Hadamard product.

Write the function in terms of these new vectors.
Then calculate the differential and gradient. $$\eqalign{ f &= p:p \\ df &= 2p:dp \\ &= 2p:(2WA\,dx) \\ &= 4A^TWp:dx \\ \frac{\partial f}{\partial x} &= 4A^TWp \,=\, 4A^TPw \;\triangleq\, g \\ }$$ Now calculate the gradient of the gradient (aka the Hessian) $$\eqalign{ dg &= 4A^TW\,dp + 4A^TP\,dw \\ &= 8A^TW^2A\,dx + 4A^TPA\,dx \\ \frac{\partial g}{\partial x} &= 8A^TW^2A + 4A^TPA \;\triangleq\, H \\ \\ }$$


In several steps above, $(:)$ has been used to denote the trace/Frobenius product, i.e. $$A:B = {\rm Tr}(A^TB)$$ The cyclic property of the trace allows the terms to be rearranged in a number of ways, e.g. $$\eqalign{ A:B &= B^T:A^T &= B:A \\ A:BC &= B^TA:C &= AC^T:B \\ }$$