gradient computation for neural ODEs

190 Views Asked by At

I was reading the paper on Neural ODEs (here) and was wondering if anyone could offer some insight on calculation of the gradient of the loss function.

If we are only considering 2 time points, $t_0,t_1$, I understand how the adjoint method works. However, what confuses me is when the loss function involves multiple time points, say $t_0,t_1,t_2$.

The paper says (on p. 15) that the adjoint step for each intervals $[t_1,t_2]$ and $[t_0,t_1]$ can be performed and that the obtained gradients can be summed. I find this confusing as well as Figure 2 of the paper (page 2).

Using the paper's notation, I understand that $\mathbf{a}(t) = \frac{dL}{d \mathbf{z}(t)}$ and $\mathbf{a}_t(t) = \frac{dL}{dt(t)}$ need to be computed first on the interval $[t_1,t_2]$ then use these results, together with an adjustment of $\frac{dL}{d \mathbf{z}(t_1)}, \frac{dL}{dt(t_1)}$, to compute the quantities $\mathbf{a}(t),\mathbf{a}_t(t)$ on the interval $[t_0,t_1]$. Specifically, according to the code in this blog, on the time interval $[t_0,t_1]$, the initial conditions have to be $\mathbf{a}(t_1) + \frac{dL}{d \mathbf{z}(t_1)}$ and $\mathbf{a}_t(t_1)-\frac{dL}{d t(t_1)}$ . Can anyone help me understand/show mathematically why the adjustments to the gradient computation have to be done like this?

1

There are 1 best solutions below

0
On

If you wish to use e.g. integral loss functions distributed on the whole domain you may want to take a look at Dissecting Neural ODEs which has code implementation here torchdyn