On p251 of Bishop's machine learning book, the Hessian for least squares is derived (as a preliminary step to the outer product approximation):
$ E = \frac{1}{2} \sum_{n=1}^{N} (y_n - t_n)^2$
$H = \nabla \nabla E = \sum_{n=1}^{N} \nabla y_n (\nabla y_n)^T + \sum_{n=1}^{N} (y_n - t_n) \nabla \nabla y_n $
Firstly, why is the Hessian not given by $\nabla \nabla ^T E$?
Secondly, could someone please explain how the full expression for the Hessian is obtained?
Let $y=[y_1,\cdots,y_n]^T,t=[t_1,\cdots,t_n]^T\in\mathbb{R}^n$; we assume that $y$ is a function of $z=[z_1,\cdots,z_p]^T$: $z\in \mathbb{R}^p\rightarrow f(z)=y\rightarrow E(y)=1/2(y-t)^T(y-t)$ and let $g=E\circ f$.
$Dg_z:u\in \mathbb{R}^p\rightarrow (y-t)^TDf_z(u)\in\mathbb{R}$.
$D^2g_z:(u,v)\in (\mathbb{R}^p)^2\rightarrow Df_z(v)^TDf_z(u)+(y-t)^TD^2f_z(u,v)\in\mathbb{R}$, that is, for every $i,j$:
$\dfrac{\partial^2g}{\partial z_i\partial z_j}=[\dfrac{\partial y_1}{\partial z_i},\cdots,\dfrac{\partial y_n}{\partial z_i}][\dfrac{\partial y_1}{\partial z_j},\cdots,\dfrac{\partial y_n}{\partial z_j}]^T+[\dfrac{\partial^2 y_1}{\partial z_i\partial z_j},\cdots,\dfrac{\partial^2 y_n}{\partial z_i\partial z_j}](y-t)$.