Stuck on proving the Hessian chain rule when $h(x) = g(f(x))$ is a real valued function.

116 Views Asked by At

enter image description here


I wish to prove the chain rule property in the above screenshot. However, I am not able to produce the outer-product.

Let $\nabla$ be the gradient operator. Then I know for $h(x) = g(f(x))$ as shown above,

$$\nabla h(x) = \nabla(g(f(x)) = g^\prime(f(x)) \nabla f(x).$$

Now I take the derivative on both sides and use the property that $\nabla^2 h(x) = D(\nabla h(x))$.

$$D(\nabla h(x)) = D(g^\prime(f(x)) \nabla f(x))$$

By the chain rule, I have,

$$D(\nabla h(x)) = D(g^\prime(f(x))) \nabla f(x) + g^\prime(f(x)) D(\nabla f(x))$$

Using $D(\nabla f(x)) = \nabla^2 f(x)$, and $D(g^\prime(f(x))) = (g^{\prime\prime}(f(x))D(f(x))$, and $D(f(x)) = \nabla f(x)^T$, I have,

$$D(\nabla h(x)) = \nabla^2 h(x) = g^{\prime\prime}(f(x))\nabla f(x)^T \nabla f(x) + g^\prime(f(x)) \nabla^2 f(x))$$

I have a dimensionality problem in the term $g^{\prime\prime}(f(x))\nabla f(x)^T \nabla f(x)$. Does anyone know how to fix this?

2

There are 2 best solutions below

2
On

The mistake is at the last step where you applied an invalid product rule. On the left hand side the function $x \mapsto \nabla h(x)$ is vector-valued, and vector-valued functions do not have gradients anymore, but Jacobians. As you wrote down the Hessian correctly, you can try to remember this as $\nabla^2 f(x)$ is the Hessian of the scalar valued function $f$ and it is also the Jacobian of the vector-valued function $x \mapsto \nabla f(x)$.

When it comes to prove formulas in multivariate calculus, I find it convenient to simply write it out coordinate-wise. So first, the gradient is $$ \partial x_i h(x) = g'(f(x)) (\nabla f(x))_i $$ and then $$ \partial x_j \partial x_i h(x) = \partial x_j \Big(g'(f(x))(\nabla f(x))_i\Big) = g''(f(x))(\nabla f(x))_i(\nabla f(x))_j + g'(f(x)) (\nabla^2 f(x))_{ij} $$ which shows that you indeed have an outer product of the gradient vectors $\nabla f(x)$.

0
On

The problem is in the chain rule of the product, the correct formula is:

If $\alpha : \mathbb R^{n}\to \mathbb R$ and $\Gamma : \mathbb R^n \to \mathbb R^{m}$ $$D\left(\alpha\Gamma\right)(x) = \Gamma(x) D(\alpha)(x) + D(\Gamma)(x) \alpha(x) = \Gamma(x)\nabla \alpha(x)^\intercal + \alpha(x) D\left(\Gamma\right)(x)$$

Indeed,

\begin{align} (\alpha \Gamma)(x+h) - (\alpha\Gamma)(x) &= \left(\alpha(x+h) - \alpha(x)\right) \Gamma(x+h) + \alpha(x) \left(\Gamma(x+h) - \Gamma(x)\right)\\ &= \left(\nabla\alpha(x)^\intercal h\right)\Gamma(x) + \alpha(x) D(\Gamma)(x) h + o(\left\|h\right\|)\\ &= \left(\Gamma(x)\nabla \alpha(x)^\intercal + \alpha(x)D(\Gamma)(x)\right) h + o(\left\|h\right\|)\\ &= \left(\Gamma(x)\nabla \alpha(x) ^\intercal + \alpha(x) D\left(\Gamma\right)(x)\right)h +o(\left\|h\right\|) \end{align}

If you want to understand why it works only when you do $\Gamma \alpha$ and does not work in the other order, it is because the $\alpha \Gamma$ is not a matrix product and $\Gamma \alpha$ is.