Hessian of the loss of a linear neural network with respect to a weight matrix

Question

Hessian of the loss of a linear neural network with respect to a weight matrix

1.2k Views Asked by Bumbble Comm At 27 Mar 2026 - 7:54

Let $L$ be the loss of a linear neural network:

$$L = \Sigma_j \lVert y_j - W_N\dots W_1 W_0x_j\rVert^2$$

what is $\nabla^2_{W_i}L$, the second derivative of $L$ with respect to $W_i$ for all $i$?

I have no real problem getting the first derivative, the problem happens when I need to take the second derivative, as it means getting the derivative of a matrix with respect to a matrix. I've been reading Matrix Differential Calculus with Applications in Statistics and Econometrics which suggests you instead solve for $\nabla^2_{vec W_i}$, but I got pretty lost in the weeds of all of the proofs and am trying to get a handle on it by following the derivation for the linear network case.

Original Q&A

There are 2 best solutions below

Bumbble Comm On 22 Nov 2017 - 10:59

For simplicity, let's assume only one layer $y=Wx$ where $W$ is a matrix $m\times n$. Then $$ \frac{\partial L}{\partial W_{ab}}=2 (Wx- y)_a \,\,x_b$$ (or something like that, I'm omitting the sumation over all cases "$j$"). This can still be represented as an $m\times n$-matrix.

The second deriative, in coordinates, could be something like $$ \frac{\partial^2 L}{\partial W_{ab}\partial W_{cd}}=2\frac{\partial}{\partial W_{cd}} \big((Wx-y)_a \,x_b\big) =\begin{cases} 2\, x_d \, x_b \quad \text{if}\,\,a=c,\\ 0\quad\text{otherwise}. \end{cases} $$ Is this what you ask for?

**Bumbble Comm** · Accepted Answer

For convenience, let's introduce the partial product matrices $$\eqalign{ P_{(i...j)} &= \prod_{k=i}^j W_k &= W_iW_{i-1}...W_j \cr A &= P_{(N...0)} \cr B_k &= P_{(N...k+1)} \cr C_k &= P_{(k-1...0)} &\implies A = B_kW_kC_k \cr }$$ Let's also use the product notation (:) for the trace, e.g. $$B:C={\rm tr}(B^TC)$$ Now we can write the loss function (ignoring the $j$ subscripts) as $$L=(Ax-y):(Ax-y)$$ and find its differential and gradient $$\eqalign{ dL &= 2(Ax-y):dA\,x \cr &= 2(Ax-y)x^T:B_k\,dW_k\,C_k \cr &= 2B_k^T(Ax-y)x^TC_k^T:dW_k \cr G_k = \frac{\partial L}{\partial W_k} &= 2B_k^T(Ax-y)x^TC_k^T \cr }$$ The Hessian is the gradient of $G_k$, so let's start with the differential, and note that at this stage, the only term which contains $W_k$ is $A$ $$\eqalign{ dG_k &= 2B_k^T\,dA\,xx^TC_k^T \cr &= 2B_k^TB_k\,dW_k\,C_kxx^TC_k^T \cr }$$ But how do you pull the $dW_k$ term out of the middle of a matrix product? That is where vectorization comes in $$\eqalign{ {\rm vec}(dG_k) &= 2(C_kxx^TC_k^T\otimes B_k^TB_k)\,{\rm vec}(dW_k) \cr H_k = \frac{\partial{\rm vec}(G_k)}{\partial{\rm vec}(W_k)} &= 2C_kxx^TC_k^T\otimes B_k^TB_k\cr \cr }$$ If you're willing to allow a tensor solution, then consider the 4th order isotropic tensor ${\mathcal E}$ with components $$\eqalign{{\mathcal E}_{ijkm} = \delta_{ik}\,\delta_{jm} \cr}$$ This tensor allows matrix products to be rearranged, in exactly the way we require $$\eqalign{ AXB = A{\mathcal E}B^T:X\cr }$$ Applying this to the differential $$\eqalign{ dG_k &= 2B_k^TB_k\,dW_k\,C_kxx^TC_k^T \cr &= 2B_k^TB_k\,{\mathcal E}\,C_kxx^TC_k^T:dW_k \cr {\mathcal H}_k = \frac{\partial G_k}{\partial W_k} &= 2B_k^TB_k\,{\mathcal E}\,C_kxx^TC_k^T \cr }$$ So the Hessian is itself, a 4th order tensor.

Note the similarity to the vectorized solution. The ${\mathcal E}$ has replaced the $\otimes$ symbol and the factors have swapped sides, but other than that it's the same solution.

Finally, you'll need to slap a $j$-subscript onto $({\mathcal H},G,L,x,y)$ and then do a sum over $j$, in order to solve the original problem. For example, the gradient and Hessian wrt $W_k$ are the sums
$$\eqalign{ G_k &= \sum_j 2B_k^T(Ax_j-y_j)x_j^TC_k^T \cr {\mathcal H}_k &= \sum_j 2B_k^TB_k\,{\mathcal E}\,C_k\,x_jx^T_j\,C_k^T \cr }$$ Update
It occurs to me that what I have calculated is $${\mathcal H_{kk}}=\frac{\partial^2L}{\partial W_k\partial W_k}$$ which is just the "diagonal" part of the full hessian $${\mathcal H_{ki}}=\frac{\partial G_k}{\partial W_i}=\frac{\partial^2L}{\partial W_k\partial W_i}$$ To calculate the missing components you must consider the $(B_k, C_k)$ terms from $dG_k$, i.e. things like $$\frac{\partial B_k}{\partial W_i},\,\,\frac{\partial C_k}{\partial W_i}$$ It's doable, but tedious.

Hessian of the loss of a linear neural network with respect to a weight matrix

There are 2 best solutions below

Related Questions in LINEAR-ALGEBRA

Related Questions in DERIVATIVES

Related Questions in QUADRATICS

Related Questions in NEURAL-NETWORKS

Related Questions in HESSIAN-MATRIX

Trending Questions

Popular # Hahtags

Popular Questions