The connection between the Jacobian, Hessian and the gradient?

37.2k Views Asked by At

In this Wikipedia article they have this to say about the gradient:

If $m = 1$, $\mathbf{f}$ is a scalar field and the Jacobian matrix is reduced to a row vector of partial derivatives of $\mathbf{f}$—i.e. the gradient of $\mathbf{f}$.

As well as

The Jacobian of the gradient of a scalar function of several variables has a special name: the Hessian matrix, which in a sense is the "second derivative" of the function in question.

So I tried doing the calculations, and was stumped.

If we let $f: \mathbb{R}^n \to \mathbb{R}$, then $$Df = \begin{bmatrix} \frac{\partial f}{\partial x_1} & \dots & \frac{\partial f}{\partial x_n} \end{bmatrix} = \nabla f$$ So far so good, but when I try to calculate the Jacobian matrix of the gradient I get $$D^2f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_2 x_1} & \dots & \frac{\partial^2 f}{\partial x_n x_1} \\ \frac{\partial^2 f}{\partial x_1 x_2} & \frac{\partial^2 f}{\partial x_2^2} & \dots & \frac{\partial^2 f}{\partial x_n x_2} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_1 x_n} & \frac{\partial^2 f}{\partial x_2 x_n} & \dots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$ Which according to this article, is not equal to the Hessian matrix but rather its transpose, and from what I can gather the Hessian is not generally symmetric.

So I have two questions, is the gradient generally thought of as a row vector? And did I do something wrong when I calculated the Jacobian of the gradient of $f$, or is the Wikipedia article incorrect?

3

There are 3 best solutions below

2
On BEST ANSWER

You did not do anything wrong in your calculation. If you directly compute the Jacobian of the gradient of $f$ with the conventions you used, you will end up with the transpose of the Hessian. This is noted more clearly in the introduction to the Hessian on Wikipedia (https://en.wikipedia.org/wiki/Hessian_matrix) where it says

The Hessian matrix can be considered related to the Jacobian matrix by $\mathbf{H}(f(\mathbf{x})) = \mathbf{J}(∇f(\mathbf{x}))^T$.

The other Wikipedia article should probably update the language to match accordingly.

As for the gradient of $f$ is being defined as a row vector, that is the way I have seen it more often, but it is noted https://en.wikipedia.org/wiki/Matrix_calculus#Layout_conventions that there are competing conventions for general matrix derivatives. However, I don't think that should change your answer for the Hessian- with the conventions you are using, you are correct that it should be transposed.

0
On

In A.4.1 of B & V's Convex Optimization book, given a scalar function $f(x)$ with $x\in \mathbf{R}^n$, the transpose of the derivative (or Jacobian) of $f$ is called the gradient of this function:

$$ \nabla f(x)=Df(x)^T $$

where $\nabla f(x)$ is a column vector and $Df(x)$ is a row vector. In the machine learning (ML) community, $$H(f(x))=\nabla^2 f(x)=\nabla\nabla^T f(x)=\nabla Df(x)=(D\nabla f(x))^T$$ where the third equality is obtained in the notational sense based on the symbol $\nabla\nabla^T f$, which might not be rigorous but makes sense in some way when combining $(D\nabla f(x))^T$ the last equality (exactly the same as Scott Staniewicz's answer, i.e., $D\nabla f(x)^T$). In a numerical sense, both taking the gradient on the derivative and transposing the derivative of the gradient yield the Hessian matrix.

Note that the notation $\nabla\nabla^T f$ is correct and standard. Technically speaking, $\nabla^2 f$ is not a standard notation in a mathematical sense, although it is commonly used in the ML community. @Scott Staniewicz


Please do not take the third and the fourth equalities seriously. They are just another ways to connect Jacobian, Hessian, and the gradient.

0
On

let us start one by one. Following the numerator layout convention, the gradient of $f(x): \mathbf{R}^n \rightarrow \mathbf{R}$ with respect to $x$ is a column vector as follow $$ \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}\\ \frac{\partial f}{\partial x_2}\\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \in \mathbf{R}^n $$

The Hessian is the second-order derivative with respect to $x$ and its a square matrix and can be summarised as $H f(x)_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$ where $i$ is the row and $j$ is the column. The Hessian matrix is $$ H_f(x) = \nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2_1} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n}\\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x^2_2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n}\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x^2_n} \end{bmatrix} \in \mathbf{R}^{n \times n} $$

I would suggest having a look at the Appendix D of this book Convex Optimisation, Dattorro.

Now, regarding the relation between Gradinet, Jacobian, and Hessain here is a summary based on the same numerator layout convention.

  • Gradient is the transpose of Jacobian, i.e. $\nabla f = J^T f$.
  • Hessian is the derivative of the gradient, i.e. $H f = J(\nabla f)$.

Lets try the $J(\nabla f)$ on the first item of the gradient $\frac{\partial f}{\partial x_1}$ in which the Jacobian is in fact the partial derivative $\frac{\partial f}{\partial x}$ and it is a row vector

$$ \frac{\partial f}{\partial x}\left ( \frac{\partial f}{\partial x_1} \right ) = \begin{bmatrix} \frac{\partial f}{\partial x_1}\left ( \frac{\partial f}{\partial x_1} \right ) & \frac{\partial f}{\partial x_2}\left ( \frac{\partial f}{\partial x_1} \right ) & \cdots & \frac{\partial f}{\partial x_n}\left ( \frac{\partial f}{\partial x_1} \right ) \end{bmatrix} \in \mathbf{R}^{1 \times n} $$ which is matching the first row of the Hessian matrix above.

Just remember that $\frac{\partial^2 f}{\partial x_1 \partial x_2} = \frac{\partial \left ( \frac{\partial f}{\partial x_1} \right )}{\partial x_2} = \frac{\partial \left ( \frac{\partial f}{\partial x_2} \right )}{\partial x_1} = \frac{\partial^2 f}{\partial x_2 \partial x_1}$.

A proof of the Hessian relation can bee seen in section A.4.3 of B&V convex optimisation book, the authors stated that "the gradient mapping is the function $\nabla f: \mathbf{R}^n \rightarrow \mathbf{R}$, with $\mathbf{dom} \nabla f = \mathbf{dom} f$, with value $\nabla f(x)$ at $x$. The derivative of this mapping is $D \nabla f(x) = \nabla^2 f(x)$"

So, as per the authors' words, the Hessian = Jacobian (gradient f(x)) as per the book convention which I think it is the numerator layout convention.