Neural ODEs-Dimension Mismatch in the proof of adjoint sensitivity method

96 Views Asked by At

In the paper 'Neural Ordinary Differential Equations' (2019) by Ricky T. Q. Chen et al., it is noted that adjoint variable $\mathbf{a}(t)$ as the gradient of the loss function with respect to the hidden state: $\mathbf{a}(t)=\frac{\partial L}{\partial\mathbf{z}(t)}$. Its dynamics are given by another ODE, which can be thought of as the instantaneous analog of the chain rule: $\frac{d\mathbf{a}(t)}{dt}=-\mathbf{a}(t)^\intercal\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}$. Here, it can be seen that $\mathbf{a}(t)$ is a column vector.

The original proof in the paper uses row vectors, and if we wish to continue representing $\mathbf{a}(t)$ as a column vector, then we have $$\mathbf{a}(t)=\frac{\partial L}{\partial\mathbf{z}(t)}=\frac{\partial L}{\partial\mathbf{z}(t+\varepsilon)}\frac{\partial\mathbf{z}(t+\varepsilon)}{\partial\mathbf{z}(t)}=\mathbf{a}(t+\varepsilon)^\intercal\frac{\partial T_{\varepsilon}(\mathbf{z}(t),t)}{\partial\mathbf{z}(t)}$$

And then

$$\begin{aligned}\frac{d\mathbf{a}(t)}{dt} &=\lim_{\varepsilon\rightarrow0^+}\frac{\mathbf{a}(t+\varepsilon)-\mathbf{a}(t)}{\varepsilon}=\lim_{\varepsilon\rightarrow0^+}\frac{1}{\varepsilon}\left[\mathbf{a}(t+\varepsilon)-\mathbf{a}(t+\varepsilon)^\intercal\frac{\partial T_{\varepsilon}(\mathbf{z}(t),t)}{\partial\mathbf{z}(t)}\right]\\&=\lim_{\varepsilon\rightarrow0^+}\frac{1}{\varepsilon}\left[\mathbf{a}(t+\varepsilon)-\mathbf{a}(t+\varepsilon)^\intercal\frac{\partial}{\partial\mathbf{z}(t)}\left(\mathbf{z}(t)+\varepsilon f(\mathbf{z}(t), t,\boldsymbol{\theta})+O(\varepsilon^2)\right)\right]\\&=\lim_{\varepsilon\rightarrow0^+}\frac{1}{\varepsilon}\left[\mathbf{a}(t+\varepsilon)-\mathbf{a}(t+\varepsilon)^\intercal\left(\mathbf{I}+\varepsilon\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}+O(\varepsilon^2)\right)\right]\\&=\lim_{\varepsilon\rightarrow0^+}\frac{1}{\varepsilon}\left[\mathbf{a}(t+\varepsilon)-\mathbf{a}(t+\varepsilon)^\intercal\left(\mathbf{I}+\varepsilon\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}+O(\varepsilon^2)\right)\right]\\&=\lim_{\varepsilon\rightarrow0^+}\frac{1}{\varepsilon}\left[-\mathbf{a}(t+\varepsilon)^\intercal\left(\varepsilon\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}+O(\varepsilon^2)\right)\right]\\ &=\lim_{\varepsilon\rightarrow0^+}-\mathbf{a}(t+\varepsilon)^\intercal\frac{\varepsilon\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}+O(\varepsilon^2)}{\varepsilon}=\lim_{\varepsilon\rightarrow0^+}-\mathbf{a}(t+\varepsilon)^\intercal\left(\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}+O(\varepsilon)\right) \\&=-\mathbf{a}(t)^\intercal\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)} \end{aligned}$$

However, the middle terms $\mathbf{a}(t+\varepsilon)-\mathbf{a}(t+\varepsilon)^\intercal$ cannot cancel each other out due to the different dimensions. If we want this equation to be valid, then initially it should be $\frac{d\mathbf{a}^\intercal(t)}{dt}$. But if so, this would lead to $\mathbf{a}^\intercal(t)=\mathbf{a}(t+\varepsilon)^\intercal\frac{\partial T_{\varepsilon}(\mathbf{z}(t),t)}{\partial\mathbf{z}(t)}$. Therefore, it results in $\mathbf{a}^\intercal(t)=\frac{\partial L}{\partial\mathbf{z}(t)}$. This contradicts the definition. Unless $\mathbf{a}(t)=\mathbf{a}^\intercal(t)$, is that the case?

1

There are 1 best solutions below

6
On

Presumably you do not want to redefine all the notational conventions of multivariable calculus. Then $\frac{∂L}{∂z(t)}$ is a Jacobi matrix. As $L$ is scalar and $z$ a vector, this Jacobian has a single row. So to define c column vector from it, you need to transpose it, that is, $a=\nabla_zL$ is indeed the gradient of $L$ rel. $z$.

This correction solves your format incompatibilities.