Let $L$ be the loss of a linear neural network:
$$L = \Sigma_j \lVert y_j - W_N\dots W_1 W_0x_j\rVert^2$$
what is $\nabla^2_{W_i}L$, the second derivative of $L$ with respect to $W_i$ for all $i$?
I have no real problem getting the first derivative, the problem happens when I need to take the second derivative, as it means getting the derivative of a matrix with respect to a matrix. I've been reading Matrix Differential Calculus with Applications in Statistics and Econometrics which suggests you instead solve for $\nabla^2_{vec W_i}$, but I got pretty lost in the weeds of all of the proofs and am trying to get a handle on it by following the derivation for the linear network case.
For convenience, let's introduce the partial product matrices $$\eqalign{ P_{(i...j)} &= \prod_{k=i}^j W_k &= W_iW_{i-1}...W_j \cr A &= P_{(N...0)} \cr B_k &= P_{(N...k+1)} \cr C_k &= P_{(k-1...0)} &\implies A = B_kW_kC_k \cr }$$ Let's also use the product notation (:) for the trace, e.g. $$B:C={\rm tr}(B^TC)$$ Now we can write the loss function (ignoring the $j$ subscripts) as $$L=(Ax-y):(Ax-y)$$ and find its differential and gradient $$\eqalign{ dL &= 2(Ax-y):dA\,x \cr &= 2(Ax-y)x^T:B_k\,dW_k\,C_k \cr &= 2B_k^T(Ax-y)x^TC_k^T:dW_k \cr G_k = \frac{\partial L}{\partial W_k} &= 2B_k^T(Ax-y)x^TC_k^T \cr }$$ The Hessian is the gradient of $G_k$, so let's start with the differential, and note that at this stage, the only term which contains $W_k$ is $A$ $$\eqalign{ dG_k &= 2B_k^T\,dA\,xx^TC_k^T \cr &= 2B_k^TB_k\,dW_k\,C_kxx^TC_k^T \cr }$$ But how do you pull the $dW_k$ term out of the middle of a matrix product? That is where vectorization comes in $$\eqalign{ {\rm vec}(dG_k) &= 2(C_kxx^TC_k^T\otimes B_k^TB_k)\,{\rm vec}(dW_k) \cr H_k = \frac{\partial{\rm vec}(G_k)}{\partial{\rm vec}(W_k)} &= 2C_kxx^TC_k^T\otimes B_k^TB_k\cr \cr }$$ If you're willing to allow a tensor solution, then consider the 4th order isotropic tensor ${\mathcal E}$ with components $$\eqalign{{\mathcal E}_{ijkm} = \delta_{ik}\,\delta_{jm} \cr}$$ This tensor allows matrix products to be rearranged, in exactly the way we require $$\eqalign{ AXB = A{\mathcal E}B^T:X\cr }$$ Applying this to the differential $$\eqalign{ dG_k &= 2B_k^TB_k\,dW_k\,C_kxx^TC_k^T \cr &= 2B_k^TB_k\,{\mathcal E}\,C_kxx^TC_k^T:dW_k \cr {\mathcal H}_k = \frac{\partial G_k}{\partial W_k} &= 2B_k^TB_k\,{\mathcal E}\,C_kxx^TC_k^T \cr }$$ So the Hessian is itself, a 4th order tensor.
Note the similarity to the vectorized solution. The ${\mathcal E}$ has replaced the $\otimes$ symbol and the factors have swapped sides, but other than that it's the same solution.
Finally, you'll need to slap a $j$-subscript onto $({\mathcal H},G,L,x,y)$ and then do a sum over $j$, in order to solve the original problem. For example, the gradient and Hessian wrt $W_k$ are the sums
$$\eqalign{ G_k &= \sum_j 2B_k^T(Ax_j-y_j)x_j^TC_k^T \cr {\mathcal H}_k &= \sum_j 2B_k^TB_k\,{\mathcal E}\,C_k\,x_jx^T_j\,C_k^T \cr }$$ Update
It occurs to me that what I have calculated is $${\mathcal H_{kk}}=\frac{\partial^2L}{\partial W_k\partial W_k}$$ which is just the "diagonal" part of the full hessian $${\mathcal H_{ki}}=\frac{\partial G_k}{\partial W_i}=\frac{\partial^2L}{\partial W_k\partial W_i}$$ To calculate the missing components you must consider the $(B_k, C_k)$ terms from $dG_k$, i.e. things like $$\frac{\partial B_k}{\partial W_i},\,\,\frac{\partial C_k}{\partial W_i}$$ It's doable, but tedious.