Differentiating row vector vs column vector when deriving the Hessian

185 Views Asked by At

Consider the one-layer neural network $y=\mathbf{w}^T\mathbf{x} +b$ and the optimization objective $J(\mathbf{w}) = \mathbb{E}\left[ \frac12 (1-y\cdot t) \right]$ where $t\in\{-1,1\}$ is the label of our data point. I am asked to compute the Hessian of $J$ at the current location $\mathbf{w}$ in the parameter space. I know that the correct solution is $H=\frac{\partial^2J}{\partial \mathbf{w}^2} = \mathbb{E}\left[ \mathbf{x}\mathbf{x}^T \right]$.

I am having issues to arrive at this exact formulation because of how derivation of row/column vectors work. My solution goes as follows: We first determine the first derivative.

\begin{align*} &\frac{\partial \mathbb{E}\left[ \frac12 \left( 1 - yt \right)^2 \right]}{\partial \mathbf{w}}\\ &= \mathbb{E}\left[ \frac{\partial \frac12 \left( 1 - yt \right)^2 }{\partial \mathbf{w}} \right]\\ &= \mathbb{E}\left[ \frac{\partial \frac12 \left( 1 - yt \right)^2 }{\partial y} \frac{\partial y}{\partial \mathbf{w}} \right]\\ &= \mathbb{E}\left[ -t\cdot(1-yt) \frac{\partial \mathbf{w}^T\mathbf{x}+b}{\partial \mathbf{w}} \right]\\ &= \mathbb{E}\left[ -t\cdot(1-yt) \mathbf{x} \right]\\ \end{align*}

Note that I think $\mathbf{x}\in\mathbb{R}^d$ is considered a column vector, and according to the matrix cookbook, $ \frac{\partial \mathbf{w}^T\mathbf{x}+b}{\partial \mathbf{w}} = \mathbf{x}$, not $\mathbf{x}^T$ (I have found sources saying otherwise...)

We now differentiate this again, in order to derive the Hessian. \begin{align*} &\frac{\partial \mathbb{E}\left[ -t\cdot(1-yt) \mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \mathbb{E}\left[ \frac{\partial -t(1-yt)\mathbf{x}}{\partial y} \frac{\partial \mathbf{w}^T\mathbf{x}+b}{\partial \mathbf{w}} \right]\\ &= \mathbb{E}\left[ \frac{\partial -t(1-yt)\mathbf{x}}{\partial y} \mathbf{x} \right]\\ &= \mathbb{E}\left[ \frac{\partial (-t+yt^2)\mathbf{x}}{\partial y} \mathbf{x} \right]\\ &= \mathbb{E}\left[ \underbrace{t^2}_{= 1} \mathbf{x} \mathbf{x} \right]\\ &= \mathbb{E}\left[ \mathbf{x} \mathbf{x} \right]\\ &\neq \mathbb{E}\left[ \mathbf{x}\mathbf{x}^T \right] \end{align*}

So here, we have column vector times column vector which is not really defined. I do not know where to get the transpose from, though. I tried deriving the whole thing again with the assumption that $\frac{\partial \mathbf{w}^T\mathbf{x}+b}{\partial \mathbf{w}} = \mathbf{x}^T$, instead of $\mathbf{x}$. Then we either get $\mathbb{E}\left[ \mathbf{x}^T\mathbf{x}^T \right]$ similar to before, or, if we assume that the derivative of a row vector w.r.t. scalar is a column vector, we get $\mathbb{E}\left[ \mathbf{x}\mathbf{x}^T \right]$, which is what we want. However,this would be a very weird assumption, to me, because why would the derivative of a row vector w.r.t. to a scalar be a column vector? And furthermore, it contradicts the matrix cookbook, which says that $\frac{\partial \mathbf{w}^T\mathbf{x}+b}{\partial \mathbf{w}} = \mathbf{x}$.

I would be very glad for help here. Where did I go wrong? Which assumptions of row/column vectors are correct? Thank you so much for your help!

Last, I found an alternative way of solving it where you don't have that issue of transposing or not, by just inserting the definition of the prediction $y$, but I still would like to know where the issue in my solution above lies.

\begin{align*} &\frac{\partial \mathbb{E}\left[ -t\cdot(1-yt) \mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \frac{\partial \mathbb{E}\left[ -t\mathbf{x} + t^2y\mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \frac{\partial \mathbb{E}\left[ -t\mathbf{x} + t^2(w^T\mathbf{x} + b)\mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \frac{\partial \mathbb{E}\left[ -t\mathbf{x} + t^2(w^T\mathbf{x})\mathbf{x} + b\mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \frac{\partial \mathbb{E}\left[ -t\mathbf{x} + t^2\mathbf{x}^Tw\mathbf{x} + b\mathbf{x} \right]}{\partial \mathbf{w}}\\ &= \mathbb{E}\left[ \underbrace{t^2}_{= 1} \mathbf{x}\mathbf{x}^T \right]\\ &= \mathbb{E}\left[ \mathbf{x}\mathbf{x}^T \right]\\ \end{align*}

2

There are 2 best solutions below

2
On

Forgetting the expectation operator (for a while), the differential of the cost function writes $$ d\phi = t(t\cdot y-1)dy = (y-t) \mathbf{x}^Td\mathbf{w} $$ since $dy=\mathbf{x}^Td\mathbf{w}$.

The gradient is $\mathbf{g} = (y-t) \mathbf{x} $

It follows that $d\mathbf{g} = dy\cdot \mathbf{x} = (\mathbf{x} \mathbf{x}^T) d\mathbf{w} $ which gives the Hessian $\mathbf{H}=\mathbf{x} \mathbf{x}^T$.

7
On

Neither is "the correct way of doing it." Nor are either wrong. There are two conventions, and as long as you follow the same convention consistently, you will get sensible answers.

If you treat the derivative as a linear operator approximating your function, then the derivative of a function $\mathbf{R}^n \to \mathbf{R}^m$ is a linear function $\mathbf{R}^n \to \mathbf{R}^m$. In the case that $m = 1$, that gives you a row vector.

Now let's start:

\begin{align} D_w (w^Tx + b)^2 &= 2(w^Tx + b)x^T \\ H_w (w^Tx + b)^2 &= D_w [2(w^Tx + b)x^T] \end{align}

Row vectors, right? But now stop right there! The function $f(w) = 2(w^Tx + b)x^T$ is a function from $\mathbf{R}^n \to \mathbf{R}^n$ even if the output is a row vector. More precisely, we want to pick a basis (the dual basis) for the space of row vectors and calculate our derivative in that basis. Representing a row vector in the dual basis is just taking the transpose, so we get

$$ H_w (w^Tx + b)^2 = D_w [2(w^Tx + b)x] = 2x D_w (w^Tx) = 2xx^T $$

where we use the rule $D_x cf(x) = c D_x f(x)$.


This is how I think about the Hessian. As the second derivative where we need to compute each derivative in the appropriate basis and that means taking the transpose on the second step. I learned to think about it this way studying differential geometry/topology.

Sources that don't want to discuss derivatives on manifolds, might simply define $H(f) = D(\nabla f)$ (the derivative/Jacobian of the gradient). Where by definition $\nabla f = (D f)^T$.

When you do it via basis to basis. Every map from $\mathbf{R}^n \to \mathbf{R}^m$ is the same type of thing. If you aren't considering bases, then you need to worry about row vectors and column vectors and there become 4 kinds of maps $\mathbf{R}^n \to \mathbf{R}^m$ (row to row, row to column, etc.). Then you need a consistent way of computing those derivatives.