ResNet derivative issue wrt the Hamiltonian

71 Views Asked by At

I'm reading Deep Learning as OCP and I've got some doubts about a derivative.

Consider an Hamiltonian built in the following way

$$\mathcal{H}(y,p,u) = \langle p,f(Ky+\beta)\rangle$$ where $K\in\mathbb{R}^{n\times n}$ and $y,\beta\in\mathbb{R}^n$. Moreover $u=(K,\beta)$ is the control and $f$ acts component-wise on $Ky+\beta$. So $f: \mathbb{R}^n \to \mathbb{R}^n $. I have problems to calculate the adjoint system. I compute

$$ \dot p = -\partial_y\mathcal{H} = -\langle p, \partial_yf(Ky+\beta)\rangle $$

At this point I made some considerations: since $f$ acts compenent-wise, $\partial_yf$ must do the same so I ended up with $\partial_yf = Kf'(Ky+\beta)$ so also $\partial_yf:\mathbb{R}^n \to \mathbb{R}^n$.

Now $$ \dot p = -\langle p, \partial_yf(Ky+\beta)\rangle = -\langle p,Kf'(Ky+\beta) \rangle = -[Kf'(Ky+\beta)]^\top p $$ but I don't think this is correct because in the article it's written something like $$ \dot p = -K\;\partial_yf\odot p $$ where $\odot$ is the component-wise product, and it seems correct since $p$ must be a vector as $y$. Someone could enlighten me?

1

There are 1 best solutions below

1
On BEST ANSWER

At this point I made some considerations: since $f$ acts compenent-wise, $\partial_yf$ must do the same so I ended up with $\partial_yf = Kf'(Ky+\beta)$ so also $\partial_yf:\mathbb{R}^n \to \mathbb{R}^n$.

This does not make sense to me. If anything, $f$ acts componentwise on the vector $x=Ky+\beta$, so you could maybe say something like this (though I'm not sure what, exactly) if you were differentiating with respect to $x$; but since $K$ mixes components of $y$ this does not work for differentiating in $y$.

You just need to own up to this and do the chain rule. In fact its good to write $F:\mathbb{R}^n \to \mathbb{R}^n$ with components being $F_j(x)=f(x_j)$. We consider $\beta$ to be a constant. Then the chain rule says $D F (Ky+\beta)=DF\cdot D(Ky+\beta)=DF\cdot K$. We then have

$$DH=D(<p, F(Ky+\beta)>)=<p, D(F(Ky+\beta)>=<p,DF\cdot K>=p^T DF K$$

and $\partial_y H=DH^T=K^T DF^T p= K^T (DF^T p)$

The matrix $DF$ is diagonal with $f'|_{(Ky)_j+\beta_j}$ on the diagonal, so $DF^T p$ has components $ f'|_{(Ky)_j+\beta_j}p_j$. Perhaps this is denoted by $\partial_y f \odot p$. Then this result almost matches with what you have written, except $K$ is $K^T$:

$$\partial_y H= K^T \partial_y f \odot p$$

However it's difficult for me to see how the result with $K$ can be correct, since the problem makes sense when $y\in \mathbb{R}^m$ and $K$ maps from $\mathbb{R}^m$ to $\mathbb{R}^n$; but then the gradient is also in $\mathbb{R}^m$, consistent with applying $K^T$ to the vector $\partial_y f \odot p \in \mathbb{R}^n$. Perhaps its an issue of notational conventions, and $\partial_y H$ is the derivative instead of the gradient -- but then $K$ would be in front. I am not entirely sure.