Consider the one-layer neural network $y=\mathbf{w}^T\mathbf{x} +b$ and the optimization objective $J(\mathbf{w}) = \mathbb{E}\left[ \frac12 (1-y\cdot t) \right]$ where $t\in\{-1,1\}$ is the label of our data point. I am asked to compute the Hessian of $J$ at the current location $\mathbf{w}$ in the parameter space. I know that the correct solution is $H=\frac{\partial^2J}{\partial \mathbf{w}^2} = \mathbb{E}\left[ \mathbf{x}\mathbf{x}^T \right]$.
I am having issues to arrive at this exact formulation because of how derivation of row/column vectors work. My solution goes as follows: We first determine the first derivative.
\begin{align*} &\frac{\partial \mathbb{E}\left[ \frac12 \left( 1 - yt \right)^2 \right]}{\partial \mathbf{w}}\\ &= \mathbb{E}\left[ \frac{\partial \frac12 \left( 1 - yt \right)^2 }{\partial \mathbf{w}} \right]\\ &= \mathbb{E}\left[ \frac{\partial \frac12 \left( 1 - yt \right)^2 }{\partial y} \frac{\partial y}{\partial \mathbf{w}} \right]\\ &= \mathbb{E}\left[ -t\cdot(1-yt) \frac{\partial \mathbf{w}^T\mathbf{x}+b}{\partial \mathbf{w}} \right]\\ &= \mathbb{E}\left[ -t\cdot(1-yt) \mathbf{x} \right]\\ \end{align*}
Note that I think $\mathbf{x}\in\mathbb{R}^d$ is considered a column vector, and according to the matrix cookbook, $ \frac{\partial \mathbf{w}^T\mathbf{x}+b}{\partial \mathbf{w}} = \mathbf{x}$, not $\mathbf{x}^T$ (I have found sources saying otherwise...)
We now differentiate this again, in order to derive the Hessian. \begin{align*} &\frac{\partial \mathbb{E}\left[ -t\cdot(1-yt) \mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \mathbb{E}\left[ \frac{\partial -t(1-yt)\mathbf{x}}{\partial y} \frac{\partial \mathbf{w}^T\mathbf{x}+b}{\partial \mathbf{w}} \right]\\ &= \mathbb{E}\left[ \frac{\partial -t(1-yt)\mathbf{x}}{\partial y} \mathbf{x} \right]\\ &= \mathbb{E}\left[ \frac{\partial (-t+yt^2)\mathbf{x}}{\partial y} \mathbf{x} \right]\\ &= \mathbb{E}\left[ \underbrace{t^2}_{= 1} \mathbf{x} \mathbf{x} \right]\\ &= \mathbb{E}\left[ \mathbf{x} \mathbf{x} \right]\\ &\neq \mathbb{E}\left[ \mathbf{x}\mathbf{x}^T \right] \end{align*}
So here, we have column vector times column vector which is not really defined. I do not know where to get the transpose from, though. I tried deriving the whole thing again with the assumption that $\frac{\partial \mathbf{w}^T\mathbf{x}+b}{\partial \mathbf{w}} = \mathbf{x}^T$, instead of $\mathbf{x}$. Then we either get $\mathbb{E}\left[ \mathbf{x}^T\mathbf{x}^T \right]$ similar to before, or, if we assume that the derivative of a row vector w.r.t. scalar is a column vector, we get $\mathbb{E}\left[ \mathbf{x}\mathbf{x}^T \right]$, which is what we want. However,this would be a very weird assumption, to me, because why would the derivative of a row vector w.r.t. to a scalar be a column vector? And furthermore, it contradicts the matrix cookbook, which says that $\frac{\partial \mathbf{w}^T\mathbf{x}+b}{\partial \mathbf{w}} = \mathbf{x}$.
I would be very glad for help here. Where did I go wrong? Which assumptions of row/column vectors are correct? Thank you so much for your help!
Last, I found an alternative way of solving it where you don't have that issue of transposing or not, by just inserting the definition of the prediction $y$, but I still would like to know where the issue in my solution above lies.
\begin{align*} &\frac{\partial \mathbb{E}\left[ -t\cdot(1-yt) \mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \frac{\partial \mathbb{E}\left[ -t\mathbf{x} + t^2y\mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \frac{\partial \mathbb{E}\left[ -t\mathbf{x} + t^2(w^T\mathbf{x} + b)\mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \frac{\partial \mathbb{E}\left[ -t\mathbf{x} + t^2(w^T\mathbf{x})\mathbf{x} + b\mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \frac{\partial \mathbb{E}\left[ -t\mathbf{x} + t^2\mathbf{x}^Tw\mathbf{x} + b\mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \mathbb{E}\left[ \underbrace{t^2}_{= 1} \mathbf{x}\mathbf{x}^T \right]\\ &= \mathbb{E}\left[ \mathbf{x}\mathbf{x}^T \right]\\ \end{align*}
Forgetting the expectation operator (for a while), the differential of the cost function writes $$ d\phi = t(t\cdot y-1)dy = (y-t) \mathbf{x}^Td\mathbf{w} $$ since $dy=\mathbf{x}^Td\mathbf{w}$.
The gradient is $\mathbf{g} = (y-t) \mathbf{x} $
It follows that $d\mathbf{g} = dy\cdot \mathbf{x} = (\mathbf{x} \mathbf{x}^T) d\mathbf{w} $ which gives the Hessian $\mathbf{H}=\mathbf{x} \mathbf{x}^T$.