Computing Hessian using matrix notation efficiently

282 Views Asked by At

I answered this question, but I'd like to understand more details about the matrix notation behind it (and that's why I'm making another post). We have $f:\Bbb R^n\to \Bbb R$ given by $$f(\theta) \doteq \alpha e^{-\beta \theta^\top\theta}, $$alright. We want to compute the bilinear map ${\rm Hess} f (\theta)$. Since I recognize $g (\theta)\doteq\theta^\top \theta$ as $\langle \theta,\theta\rangle$ (of course, $\langle \cdot,\cdot\rangle$ denotes the usual scalar product), I see that $$Dg(\theta) = 2\langle \theta, \cdot \rangle = 2\theta^\top, $$and hence $\nabla g (\theta) = 2\theta $. Then chain rule gives $$\nabla f (\theta) = -2\alpha \beta e^{-\beta \theta^\top \theta}\theta $$as the OP of the linked question states, so far so good.

I'm having trouble doing something similar to check that $${\rm Hess}f (\theta)=2\alpha \beta e^{-\beta \theta^\top\theta}(2\beta \color{blue}{\theta\theta^\top}-{\rm Id}_n).$$I do not want to use components as I did there.

A simple attempt is to use the product rule together with ${\rm d}\theta ={\rm Id}_n $. Differentiating the expression for $\nabla f (\theta) $ we get $$-2\alpha\beta (e^{-\beta\theta^\top\theta}(-2\beta \theta^\top)\theta +e^{-\beta \theta^\top\theta}{\rm Id}_n) = 2\alpha \beta e^{-\beta \theta^\top\theta}(2\beta\color{red}{\theta^\top\theta}-{\rm Id}_n), $$but this doesn't compile, and I can't see why the order comes out wrong.

So I'd like to know exactly what identification am I missing here. I also recognize $\theta\theta^\top$ as the matrix of the bilinear map $\theta \otimes \theta$, and I'm comfortable with tensor products, so you can come in with guns blazing, if needed.

Thanks.

2

There are 2 best solutions below

1
On BEST ANSWER

Although you've already used $g$, I'd like to use it to denote the gradient, i.e. $\,\,g=\nabla f$

Find the differential of the gradient, then the hessian $$\eqalign{ g &= -2\beta f\theta \cr dg &= -2\beta(\theta\,df+f\,d\theta) \cr &= -2\beta(\theta g^Td\theta+fI\,d\theta) \cr H=\frac{\partial g}{\partial\theta} &= -2\beta\,(\theta g^T+fI) \cr &= 2\beta\,\,\big(\theta(2\beta f\theta)^T-fI\big) \cr &= 2f\beta\,\,(2\beta\,\theta\theta^T-I) \cr }$$ As expected, this is your result but with the change $$\theta^T\theta \implies \theta\theta^T$$

3
On

I don't know exactly what you mean by "using matrix notation efficiently", since there is no matrix on the post (:P), but I think that it is something along the following lines.

As you have concluded, $g(x):=\nabla f(x)=-2\alpha \beta e^{-\beta\langle x,x \rangle}x$. By the product rule, $$(\mathrm{Hess}_x f)(h)=g'_x(h)=-2\alpha \beta e^{-\beta\langle x,x\rangle}h+2\alpha \beta e^{-\beta \langle x,x\rangle}2\beta\langle x,h \rangle x$$ $$=2\alpha \beta e^{-\beta \langle x, x\rangle}(2\beta\langle x,h\rangle x-h). $$ So we have to understand the linear maps $$h \mapsto 2\beta \langle x, h \rangle x$$ and $$h \mapsto h.$$ The latter is obvious: it is $\mathrm{Id}$. The former is precisely $2\beta x^* \otimes x$, where $x^*=\langle x, \cdot\rangle$, under the canonical identification $Hom(V;W) \leftrightarrow V^*\otimes W.$ Matricially, note that $(x^* \otimes x)(e_i)=x^*(e_i)x=x_ix,$ which says precisely that $x^* \otimes x=x x^T$ (of course, you could do this computation without using the fact that the map can be represented by $2\beta x^* \otimes x$, but I think this makes things clearer).