On the Hessian of the Log-Determinant and the solution provided in Stephen Boyd's textbook

86 Views Asked by At

This is a follow up to another question on the second-order approximation to log-determinant in Boyd's textbook, the excerpt can be found here:

enter image description here

Here, $f$ is the log-determinant, $f(Z) = \log(\det(Z))$, where $Z$ is symmetric positive definite.

The final equation,

$f(Z) \approx f(x) + \text{tr}(X^{-1} (Z-X)) - (1/2) \text{tr}(X^{-1} (Z-X) X^{-1} (Z-X))$ admittedly looks nothing like the second-order approximation as defined by Boyd earlier.

enter image description here

I would have expected that the second-order/Hessian term is written as: $f(Z) \approx f(x) + \text{tr}(X^{-1} (Z-X)) - (1/2) \text{tr}((Z-X)^T X^{-2} (Z-X))$ instead. At least this somewhat resembles the second-order term.

Why is the original express the correct second-order term?

And if it is correct, then what is expression of the Hessian $\nabla^2 f(X)$ of log-determinant? It is not apparent to me how this term can be extracted out of the second-order approximation.

3

There are 3 best solutions below

0
On

$$(\log \det X)'(H)=\mathrm{trace}(X^{-1}H^T),\\ (\log \det X)''(H,K)=-\mathrm{trace}((X^{-1})^TKX^{-1}H^T)$$

3
On

$ \def\h{\tfrac12} \def\E{{\cal E}} \def\H{{\cal H}} \def\bR#1{\big(#1\big)} \def\BR#1{\Big[#1\Big]} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\trace#1{\op{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\hess#1#2#3{\grad{^2 #1}{#2\,\p #3}} \def\Hess#1#2{\grad{^2 #1}{#2^2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} \def\hessLR#1#2#3{\LR{\hess{#1}{#2}{#3}}} \def\HessLR#1#2{\LR{\Hess{#1}{#2}}} \def\X{X^{-1}} \def\XX{X^{-2}} $For a symmetric matrix, you are given a function $f$ and its gradient $G$ $$\eqalign{ f &= \log\det(X), \qquad G &= \grad{f}{X} = X^{-1} \\ }$$ and asked to calculate the Hessian $\H$.

Start with the identity $\X X=I$ and differentiate $$\eqalign{ &d\X X + \X dX = 0 &\qiq d\X = -\X\:dX\:\X \\ &dG = -\X\:dX\:\X &\qiq \H=\grad GX = -\X\:\E\:\X \\ }$$ Note that $\,\c{d\X\ne-\XX\,dX}\:$ and this invalidates your proposed Taylor expansion.

To derive Boyd's expansion, read on...

Both $\H\;{\rm and}\;\E$ are fourth-order tensors.
The most important property of $\H$ is the following $$\eqalign{ dG = \H:dX = -\X\,dX\,\X \qiq \H:Y = -\X\,Y\,\X \\ }$$ Using $G\;{\rm and}\;\H$ in a second-order Taylor series yields $$\eqalign{ \def\X{X^{-1}} f(Z) &= f(X)\;+\;\c{G}:\LR{Z-X}\;+\;\h\LR{Z-X}:\c{\H:\LR{Z-X}} \\ &= f(X)\;+\;\c{\X}:\LR{Z-X}\;-\;\h\LR{Z-X}:\CLR{\X\LR{Z-X}\X} \\ &= f(X)\;+\;\trace{\X\LR{Z-X}}\;-\;\h\trace{\LR{Z-X}\X\LR{Z-X}\X} \\ }$$


In the above, a colon $(:)$ denotes the matrix inner product $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ }$$ Since $X$ is symmetric, I ignored transpose operations in my derivation.
When $X$ is not symmetric, one must be more careful.

When $(x,z)$ are vectors, then the gradient $g$ is a vector and the Hessian $H$ is a matrix.
Upon replacing the matrix inner product with the vector inner product (aka the dot product),
one recovers Boyd's previous formula $$\eqalign{ f(z) &=f(x)\;+\;\c{g}\cdot\LR{z-x}\;+\;\h\LR{z-x}\cdot\c{H\cdot\LR{z-x}}\\ &=f(x)\;+\;g^T\LR{z-x}\;+\;\h\LR{z-x}^TH\LR{z-x} \\ }$$ The last line could be written using the trace function, but it is unnecessary since the terms are scalars.

In fact, I would argue that using the trace function in the matrix case is misleading, since it cannot be generalized. It's unnecessary when the independent variable is a scalar or a vector, and it doesn't work at all when it's a higher-order tensor. It's a one-trick pony that is best forgotten.

0
On

The second order formula

$$ \widehat{f}(z) = f(x) + ∇f(x)^T(z-x) + \tfrac{1}{2}(z-x)^T∇^2f(x)(z-x) $$

only really works when $f:ℝ^n → ℝ$. The general Taylor expansion for $f:U→V$ looks like this:

$$\begin{aligned} f(u+∆u) &≈ f(u) + f[u](∆u) + \tfrac{1}{2}^2f[u]\big(∆u^{⊗2}\big) \\ \end{aligned}$$

Since the first derivative at $u$ is, by definition, a linear function $U→V$, i.e. $f[u]∈\text{Lin}(U, V)$. Then the second derivative at $u$ is, again by definition, a linear function

$$^2f[x]∈\text{Lin}(U, \text{Lin}(U, V)) ≅ \text{Lin}(U⊗U,V) ≅ V⊗(U⊗U)^*$$

Now, the linear function evaluation $^2f[u]\big((z-u)^{⊗2}\big)$ can be represented as a tensor contraction $_f(u)⋅(z-u)^{⊗2}$, where $_f(u)∈V⊗(U^{⊗2})^*$ is a type (1,2) tensor.

If $V=ℝ$ is scalar, we can express the Taylor expansion in terms of inner products:

$$\begin{aligned} f(u+∆u) &= f(u) + ⟨∇f(u)∣∆u⟩_U + \tfrac{1}{2}⟨∇^2f(u)∣∆u^{⊗2}⟩_{U^{⊗2}} + … \\&= ∑_{k=0}^∞\tfrac{1}{k!} ⟨∇^kf(u)∣∆u^{⊗k}⟩_{U^{⊗k}} \end{aligned}$$

We shall call this the "standard form" of the Taylor expansion.

Thus, your question can be rephrased as:

Why is $⟨∇^2f(X)∣∆X^{⊗2}⟩ = \operatorname{tr}(X^{-1}{∆X}X^{-1}{∆X})$ for $f(X)=\log\det X$?


For the special case of the log-determinant, we have $V=ℝ$ and $U=ℝ^{n×n}$, which is a Hilbert space equipped with the Frobenius inner product $⟨A∣B⟩= \operatorname{tr}(AᵀB)$. As others already pointed out, $∇f(X) = X^{-1}$, or, equivalently, $f[X](∆X)= ⟨X^{-T}∣∆X⟩ = \operatorname{tr}(X^{-1}{∆X})$.

The second order term is thus

$$\begin{aligned} ⟨∇_X⟨∇_Xf(X)∣{∆X}⟩∣{∆X}⟩ &= ⟨∇_X⟨X^{-1}∣{∆X}⟩∣{∆X}⟩_{ℝ^{n×n}} \\&= ⟨X^{-1}{∆X}X^{-1}∣{∆X}⟩_{ℝ^{n×n}} \\&= \operatorname{tr}(X^{-1}{∆X}X^{-1}{∆X}) \qquad \text{note: symmetric matrices!} \end{aligned}$$

On the other hand, $h(∆X) = -X^{-1}{∆X}X^{-1}$ is a linear map, and thus can be expressed as $_f(X) ⋅ {∆X}$. Specifically,

$$\begin{aligned} h(∆X)_{mn} &= (-X^{-1}{∆X}X^{-1})_{mn} \\&= ∑_{kl}\Big(-X_{mk}^{-1}X_{ln}^{-1}\Big)_{mnkl}{∆X}_{kl} \\&= _f(X) ⋅ {∆X} \qquad // \text{ '⋅' is a 2d contraction} \end{aligned}$$

Plugging this back in, we get:

$$\begin{aligned} ⟨∇_X⟨∇_Xf(X)∣{∆X}⟩∣{∆X}⟩ &= ⟨_f(X) ⋅ {∆X}∣{∆X}⟩_{ℝ^{n×n}} \\&=⟨_f(X) ∣{∆X}{∆X}^T⟩_{ℝ^{n×n}⊗ ℝ^{n×n}} && \text{universal property dual} \\&=⟨_f(X) ∣{∆X}⊗{∆X}⟩_{ℝ^{n×n}⊗ ℝ^{n×n}} && \text{equivalent, just diff. notation} \end{aligned}$$

NOTE: In practice, when evaluating the tensor expansion one should avoid computing the derivative tensors, since calculating $_f$ takes $O(n^4)$ operations. It's best to use autodiff backward mode.


Here's some python code that validates the results:

enter image description here

# %%
from functools import partial

import jax
import numpy as np
from jax import config
from jax import numpy as jnp
from scipy.stats import wishart

config.update("jax_enable_x64", True)


# %%
n = 3
X = jnp.array(wishart.rvs(3, scale=jnp.eye(3)))
V = jnp.array(wishart.rvs(3, scale=jnp.eye(3)))
V /= jnp.linalg.norm(V, ord="fro")  # normalize

jnp.linalg.slogdet(X), jnp.linalg.eigvalsh(X)


# %%
def f(x):
    return jnp.linalg.slogdet(x)[-1]


g = jax.grad(f)


def d1(x, dx):
    return jnp.tensordot(g(x), dx)


h = jax.grad(d1)


def d2(x, dx):
    return jnp.tensordot(h(x, dx), dx)


def taylor(x, dx):
    return f(x) + d1(x, dx) + 0.5 * d2(x, dx)


def boyd(x, dx):
    G = jnp.linalg.lstsq(x, dx)[0]  # X⁻¹⋅∆X
    return f(x) + jnp.trace(G) - 0.5 * jnp.trace(G @ G)


def ours(x, dx):
    iX = jnp.linalg.inv(x)
    g = jnp.tensordot(iX, dx)
    H = -jnp.einsum("ij, kl -> ikjl", iX, iX)
    dx2 = jnp.einsum("ij, kl -> ijkl", dx, dx)
    Hdx2 = jnp.einsum("ijkl, ijkl -> ", H, dx2)
    return f(X) + g + 0.5 * Hdx2


boyd_meth = jax.vmap(partial(boyd, X))
ours_meth = jax.vmap(partial(ours, X))
base_meth = jax.vmap(partial(taylor, X))


# %%
eps = jnp.logspace(-6, -1, 20)
DX = jnp.einsum("..., ij -> ...ij", eps, V)

true_vals = jax.vmap(f)(X + DX)
base_vals = base_meth(DX)
boyd_vals = boyd_meth(DX)
ours_vals = ours_meth(DX)

err_boyd = jnp.abs(true_vals - boyd_vals) / eps**2
err_ours = jnp.abs(true_vals - ours_vals) / eps**2
err_base = jnp.abs(true_vals - base_vals) / eps**2


# %%
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(5, 3), constrained_layout=True)
ax.plot(eps, err_base, "-r", label="taylor", lw=3)
ax.plot(eps, err_boyd, ":k", label="boyd")
ax.plot(eps, err_ours, "*k", label="ours")

ax.legend()
ax.set_xscale("log")
ax.set_yscale("log")
fig.suptitle(r"Relative error $\frac{|f(X+ε∆X) - \hat{f}(ε∆X)|}{‖ε∆X‖_F^2}$ w.r.t. $ε$")
fig.savefig("4852678-logdet-error.png", dpi=300)