Analytical formula for third-order partial derivatives of feed-forward neural networks

58 Views Asked by At

is there an analytical formula for the third-order partial derivatives of a simple feed-forward neural network's output with respect to its inputs? (NOT with respect to the weights and biases of the neural network!)

For first-order partial derivatives (Jacobian) and second-order partial derivatives (Hessian), there are efficient formulas (e.g., https://link.springer.com/article/10.1007/s00291-022-00672-1) which are quite straightforward to implement. However, I couldn't find similar formulas for the third-order partial derivatives.

To give you some background: I have trained a neural network, which I use in a subroutine of a numerical simulation code. This subroutine is called several 10,000 times during the simulations and should output all partial derivatives of the network's output with respect to its input up to order three. Therefore, I need an efficient (preferably analytical formula).

Thanks for your help!

1

There are 1 best solutions below

2
On

(Too long for a comment.)

For a given smooth function $\mathbf{F} = \Biggl[ \begin{smallmatrix} f^1 \\ \vdots \\ f^n \end{smallmatrix} \Biggr] : \mathbb{R}^m \to \mathbb{R}^n$, let

$$ \mathbf{F}^{(k)} = [ f^{i}_{j_1, \ldots, j_k} ], \qquad f^{i}_{j_1, \ldots, j_k} = \frac{\partial^k f^i}{\partial x^{j_1} \cdots \partial x^{j_k}}, $$

denote the tensor representing the $k$-th derivative of $\mathbf{F}$. Using this notation, we investigate how the third-order derivative $(\mathbf{F} \circ \mathbf{G})^{(3)}$ relates to the derivatives of $\mathbf{F}$ and $\mathbf{G}$.

Indeed, writing $\mathbf{H} = \mathbf{F} \circ \mathbf{G}$ for simplicity, the chain rule tells that

$$ h^{i}_{j} = \sum_k f^{i}_{k} g^{k}_{j}. $$

Then by the chain rule again,

$$ h^{i}_{j_1 j_2} = \sum_k (f^{i}_{k} g^{k}_{j_1})_{j_2} = \sum_{k_1, k_2} f^{i}_{k_1 k_2} g^{k_1}_{j_1} g^{k_2}_{j_2} + \sum_k f^{i}_{k} g^{k}_{j_1 j_2} $$

and

\begin{align*} h^{i}_{j_1 j_2 j_3} &= \sum_{k_1, k_2, k_3} f^{i}_{k_1 k_2 k_3} g^{k_1}_{j_1} g^{k_2}_{j_2} g^{k_2}_{j_3} \\ &\quad + \sum_{k_1, k_2} f^{i}_{k_1 k_2} g^{k_1}_{j_1 j_2} g^{k_2}_{j_3} + \sum_{k_1, k_2} f^{i}_{k_1 k_2} g^{k_1}_{j_2 j_3} g^{k_2}_{j_1} + \sum_{k_1, k_2} f^{i}_{k_1 k_2} g^{k_1}_{j_3 j_1} g^{k_2}_{j_2} \\ &\quad + \sum_k f^{i}_{k} g^{k}_{j_1 j_2 j_3}. \end{align*}

One-Step Forward Propagation.

As a special case, assume $\mathbf{F}$ takes the form of a single layer of a feed-forward neural network:

$$ \mathbf{F}(\mathbf{X}) = \alpha(\mathbf{Z}) = \begin{bmatrix} \alpha(z^1) \\ \alpha(z^2) \\ \vdots \end{bmatrix}, \qquad \mathbf{Z} = \mathbf{W}\mathbf{X} + \mathbf{b} = \begin{bmatrix} \sum_k w^1_k x^k + b^1 \\ \sum_k w^2_k x^k + b^2 \\ \vdots \end{bmatrix}, $$

where $\mathbf{W} = [w^{i}_{j}]$ is a matrix, $\mathbf{b} = [b^i]$ is a column vector, and $\alpha : \mathbb{R} \to \mathbb{R}$ is applied elementwise. Then

\begin{align*} f^i_j &= \alpha'(z^i) w^i_j, \\ f^i_{j_1 j_2} &= \alpha''(z^i) w^i_{j_1} w^i_{j_2}, \\ f^i_{j_1 j_2 j_3} &= \alpha'''(z^i) w^i_{j_1} w^i_{j_2} w^i_{j_3}. \end{align*}

Hence, introducing auxiliary tensors $\hat{\mathbf{G}}^{(k)} = \mathbf{W}\mathbf{G}^{(k)}$ with entries $ \hat{g}^i_{j_1, j_2, \ldots} = \sum_p w^i_p g^p_{j_1, j_2, \ldots} $, we get

\begin{align*} h^i_j &= \alpha'(z^i) \hat{g}^i_j, \\[1em] h^i_{j_1 j_2} &= \alpha''(z^i) \hat{g}^i_{j_1}\hat{g}^i_{j_2} + \alpha'(z^i) \hat{g}^i_{j_1 j_2}, \\[1em] h^i_{j_1 j_2 j_3} &= \alpha'''(z^i) \hat{g}^i_{j_1}\hat{g}^i_{j_2}\hat{g}^i_{j_3} + \alpha''(z^i) [\hat{g}^i_{j_1 j_2}\hat{g}^i_{j_3} + \hat{g}^i_{j_2 j_3}\hat{g}^i_{j_1} + \hat{g}^i_{j_3 j_1}\hat{g}^i_{j_2}] + \alpha'(z^i) \hat{g}^i_{j_1 j_2 j_3}. \end{align*}

We can vectorize (at least some of) these operations using linear algebra libraries, such as numpy in Python. Then this can be used recursively to compute the derivatives up to order $3$ of the output feature vector as a function of the input feature vector.

Remark. A main drawback of this approach is that computing product of the form $[\hat{g}^i_{j_1}\hat{g}^i_{j_2}\hat{g}^i_{j_3}]$ is very time-costly, presumably du to the lack of an efficient implementation of the product of this type in major linear algebra libraries. For this reason, I think it is more efficient to compute the derivatives in a backward fashion, where the number of computing this exotic product can be reduced significantly. I will revisit this idea when I got some time.