Neural ODEs-Dimension Mismatch in the proof of adjoint sensitivity method

Question

Neural ODEs-Dimension Mismatch in the proof of adjoint sensitivity method

96 Views Asked by Bumbble Comm At 27 Mar 2026 - 1:44

In the paper 'Neural Ordinary Differential Equations' (2019) by Ricky T. Q. Chen et al., it is noted that adjoint variable $\mathbf{a}(t)$ as the gradient of the loss function with respect to the hidden state: $\mathbf{a}(t)=\frac{\partial L}{\partial\mathbf{z}(t)}$. Its dynamics are given by another ODE, which can be thought of as the instantaneous analog of the chain rule: $\frac{d\mathbf{a}(t)}{dt}=-\mathbf{a}(t)^\intercal\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}$. Here, it can be seen that $\mathbf{a}(t)$ is a column vector.

The original proof in the paper uses row vectors, and if we wish to continue representing $\mathbf{a}(t)$ as a column vector, then we have $$\mathbf{a}(t)=\frac{\partial L}{\partial\mathbf{z}(t)}=\frac{\partial L}{\partial\mathbf{z}(t+\varepsilon)}\frac{\partial\mathbf{z}(t+\varepsilon)}{\partial\mathbf{z}(t)}=\mathbf{a}(t+\varepsilon)^\intercal\frac{\partial T_{\varepsilon}(\mathbf{z}(t),t)}{\partial\mathbf{z}(t)}$$

And then

$$\begin{aligned}\frac{d\mathbf{a}(t)}{dt} &=\lim_{\varepsilon\rightarrow0^+}\frac{\mathbf{a}(t+\varepsilon)-\mathbf{a}(t)}{\varepsilon}=\lim_{\varepsilon\rightarrow0^+}\frac{1}{\varepsilon}\left[\mathbf{a}(t+\varepsilon)-\mathbf{a}(t+\varepsilon)^\intercal\frac{\partial T_{\varepsilon}(\mathbf{z}(t),t)}{\partial\mathbf{z}(t)}\right]\\&=\lim_{\varepsilon\rightarrow0^+}\frac{1}{\varepsilon}\left[\mathbf{a}(t+\varepsilon)-\mathbf{a}(t+\varepsilon)^\intercal\frac{\partial}{\partial\mathbf{z}(t)}\left(\mathbf{z}(t)+\varepsilon f(\mathbf{z}(t), t,\boldsymbol{\theta})+O(\varepsilon^2)\right)\right]\\&=\lim_{\varepsilon\rightarrow0^+}\frac{1}{\varepsilon}\left[\mathbf{a}(t+\varepsilon)-\mathbf{a}(t+\varepsilon)^\intercal\left(\mathbf{I}+\varepsilon\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}+O(\varepsilon^2)\right)\right]\\&=\lim_{\varepsilon\rightarrow0^+}\frac{1}{\varepsilon}\left[\mathbf{a}(t+\varepsilon)-\mathbf{a}(t+\varepsilon)^\intercal\left(\mathbf{I}+\varepsilon\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}+O(\varepsilon^2)\right)\right]\\&=\lim_{\varepsilon\rightarrow0^+}\frac{1}{\varepsilon}\left[-\mathbf{a}(t+\varepsilon)^\intercal\left(\varepsilon\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}+O(\varepsilon^2)\right)\right]\\ &=\lim_{\varepsilon\rightarrow0^+}-\mathbf{a}(t+\varepsilon)^\intercal\frac{\varepsilon\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}+O(\varepsilon^2)}{\varepsilon}=\lim_{\varepsilon\rightarrow0^+}-\mathbf{a}(t+\varepsilon)^\intercal\left(\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)}+O(\varepsilon)\right) \\&=-\mathbf{a}(t)^\intercal\frac{\partial f(\mathbf{z}(t), t,\boldsymbol{\theta})}{\partial\mathbf{z}(t)} \end{aligned}$$

However, the middle terms $\mathbf{a}(t+\varepsilon)-\mathbf{a}(t+\varepsilon)^\intercal$ cannot cancel each other out due to the different dimensions. If we want this equation to be valid, then initially it should be $\frac{d\mathbf{a}^\intercal(t)}{dt}$. But if so, this would lead to $\mathbf{a}^\intercal(t)=\mathbf{a}(t+\varepsilon)^\intercal\frac{\partial T_{\varepsilon}(\mathbf{z}(t),t)}{\partial\mathbf{z}(t)}$. Therefore, it results in $\mathbf{a}^\intercal(t)=\frac{\partial L}{\partial\mathbf{z}(t)}$. This contradicts the definition. Unless $\mathbf{a}(t)=\mathbf{a}^\intercal(t)$, is that the case?

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2024-01-01 11:40:22

Presumably you do not want to redefine all the notational conventions of multivariable calculus. Then $\frac{∂L}{∂z(t)}$ is a Jacobi matrix. As $L$ is scalar and $z$ a vector, this Jacobian has a single row. So to define c column vector from it, you need to transpose it, that is, $a=\nabla_zL$ is indeed the gradient of $L$ rel. $z$.

This correction solves your format incompatibilities.

Neural ODEs-Dimension Mismatch in the proof of adjoint sensitivity method

There are 1 best solutions below

Related Questions in ORDINARY-DIFFERENTIAL-EQUATIONS

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions