Proof of Stability of parabolic-like Convolutional Residual Neural Networks

76 Views Asked by At

I am trying to understand a recent publication by Lars Ruthotto and Eldad Haber called Deep Neural Networks motivated by Partial Differential Equations published in the Journal of Mathematical Imaging and Vision.

In this paper, the authors derive a continuous interpretation of (Convolutional) Residual Neural Networks by seeing ResNet structures as a forward Euler discretization of an initial value problem.


In the following, I try to provide a straight forward introduction to the paper and derive the equation I do not understand.

We have following notation:

  • training data: $\mathbf{Y} = [\mathbf{y}_1, \mathbf{y}_2, \dots, \mathbf{y}_s]^\top \in \mathbb{R}^{n \times s} $
  • true labels: $\mathbf{C} = [\mathbf{c}_1, \mathbf{c}_2, \dots, \mathbf{c}_s]^\top \in \mathbb{R}^{m \times s}$
  • weight vector: $\mathbf{\theta} = \begin{bmatrix} \mathbf{\theta}^{(1)} & \mathbf{\theta}^{(2)} & \mathbf{\theta}^{(3)} \end{bmatrix}^\top$
  • linear Operators: $\mathbf{K}_1(\cdot) \in \mathbb{R}^{\tilde{w} \times n}$ and $\mathbf{K}_2(\cdot) \in \mathbb{R}^{w_{\mathrm{out}} \times n}$
    • $\tilde{w}$ and $w_{\mathrm{out}}$ denote the width of the layer
  • activation function $\sigma: \mathbb{R} \rightarrow \mathbb{R}$ (component-wise)
  • normalization layer $\mathcal{N}$ (will be ignored)

The authors provide a general formulation of a layer as follows,

$$\mathbf{F}(\mathbf{\theta}, \mathbf{Y}) = \mathbf{K}_2(\mathbf{\theta^{(3)}}) \sigma \left( \mathcal{N} (\mathbf{K}_1(\mathbf{\theta^{(1)}} \mathbf{Y}),\mathbf{\theta^{(2)}}) \right) \, .$$ For simplicity, we will assume a symmetric version $$\mathbf{F}_{\mathrm{sym}}(\mathbf{\theta}, \mathbf{Y}) = -\mathbf{K}(\mathbf{\theta^{(3)}}) \sigma \left( \mathbf{K}(\mathbf{\theta^{(1)}} \mathbf{Y})) \right) \, ,$$ where $K_2 = -K_1^\top$ and $\mathcal{N}(\mathbf{Y})=\mathbf{Y}$. The propagation of a ResNet layer then is given by $$\begin{align} \tag{1} \label{1} \mathbf{Y}_{j+1} = \mathbf{Y}_{j} + \mathbf{F}_{\mathrm{sym}}(\mathbf{\theta}^{(j)}, \mathbf{Y}_j), \quad \text{for} \ j=0,1,\dots, N-1 \, . \end{align}$$ Next, they introduce the initial value problem $$\begin{align} \tag{2} \label{2} \partial_t \mathbf{Y}(\mathbf{\theta}, t) = &\mathbf{F}_{\mathrm{sym}}(\mathbf{\theta}(t), \mathbf{Y}(t)), \ \text{for}\ t \in (0,T] \\ \mathbf{Y}(\mathbf{\theta},0) = &\mathbf{Y}_0 \, , \end{align} $$ which then is discretized by the forward Euler method (stepsize $h=1$) to obtain equation (\ref{1}). Furthermore, a definition of stability of ODEs/PDEs is given as follows:

"Here, we say that the forward propagation in equation (\ref{1}) is stable if there is a constant $M > 0$ independent of T such that $$ \begin{align} \tag{3} \label{3} \vert \vert \mathbf{Y}(\mathbf{\theta},T) - \mathbf{\tilde{Y}}(\mathbf{\theta},T) \vert \vert_{\mathrm{F}} \leq M \vert \vert \mathbf{Y}_0 - \mathbf{\tilde{Y}}_0 \vert \vert_{\mathrm{F}}\text{"} \, . \end{align}$$ $\vert \vert \cdot \vert \vert_{\mathrm{F}}$ denotes the Frobenius norm. Now, after having a continuous interpretation of the residual neural network, they prove the stability of that network as follows:

Theorem. If the activation function $\sigma$ is monotonically non-decreasing, then the forward propagation through a parabolic CNN satisfies eq. (\ref{3}).

Proof. First, we show that $\mathbf{F}_{\mathrm{sym}}(\mathbf{\theta},t)$ is a monotone operator. Let $\mathbf{Y}$ and $\mathbf{\tilde{Y}}$ be the solutions of eq. (\ref{2}) for the initial values $\mathbf{Y}_0$, $\mathbf{\tilde{Y}}_0$, respectively. Note that for all $t\in (0,T]$, we have

$$ \begin{align} -(\sigma (\mathbf{K}(t) \mathbf{Y}) - \sigma (\mathbf{K}(t) \mathbf{\tilde{Y}}), \mathbf{K}(t) (\mathbf{Y}- \mathbf{\tilde{Y}}) \leq 0 \end{align} \, . $$

Where $(\cdot,\cdot)$ denotes the inner product. The inequality follows from the monotonicity of $\sigma$. From this it follows that

$$ \begin{align} \tag{4} \label{4} \partial_t \vert\vert \mathbf{Y}(t) - \mathbf{\tilde{Y}}(t) \vert\vert_{\mathrm{F}}^2 \leq 0 \, . \end{align} $$

Integrating this inequality from $[0,T]$ yields stability as defined above. $\square$

Question 1: I do not see how eq. (\ref{4}) is connected to the above equation. Furthermore, I do not know how to apply the partial deriviate with respect to $t$ in eq. (\ref{4}), due to the Forbenius norm. This seems very essential in order to understand the proof.

Question 2: Another object I could not understand so far is the Jacobian of $\mathbf{F}_{\mathrm{sym}}$ with respect to the features $\mathbf{Y}$, which is given by

$$ \begin{align} \mathbf{J}_{\mathbf{Y}} \mathbf{F}_{\mathrm{sym}} = - \mathbf{K} \operatorname{diag}(\sigma'(\mathbf{K}(\mathbf{\theta}) \mathbf{Y}))\mathbf{K}(\mathbf{\theta}) \, . \end{align} $$

How does one obtain the above Jacobian?