Taking infinite-width limit on multilayer perceptrons can simplify Taylor expansion

139 Views Asked by At

Settings

When we apply Taylor expansion to a trained network function $f(x; \theta^*)$ around the initialized value of the parameters $\theta$. Suppose $\theta$ and the derivatives $f(x;\theta)$ are both scalar, we obtain

$$ f(x;\theta^{*}) = f(x;\theta) + (\theta^{*}-\theta) \frac{df}{d\theta}+\frac{1}{2}(\theta^{*}-\theta)^2\frac{d^2f}{d\theta^2} $$

Now consider a neural network architecture that has a width of $n$, and a fixed depth of $L$. According to The Principles of Deep Learning Theory (PDLT) (p.7), if we take the limit, a.k.a the infinite-width limit, on an idealized network, $$ \lim_{n\rightarrow \infty} p(f^{*}) $$ we can simplify the distribution over trained networks $p(f^{*})$. Without giving any proof, the book gives two claims:

Claim 1

All the higher derivative terms $\frac{d^kf}{d\theta^k}$ for $k \geq 2$ will effectively vanish, meaning we only need to keep track of two terms, $$ f, \frac{df}{d\theta} $$

Claim 2

The distributions of these random functions will be independent $$ \lim_{n\rightarrow \infty} p \left( f,\frac{df}{d\theta},\frac{d^2f}{d\theta^2}, \ldots \right) =p(f)p \left( \frac{df}{d\theta} \right) $$

Would someone explain the above claims?

1

There are 1 best solutions below

0
On BEST ANSWER

First, it's important to understand that this result is not true in general, but instead is only true for the specific set of functions $f(x; \theta)$ that are multilayer perceptrons. At a very high level, these claims are true because of the way the central limit theorem applies to the neural network function.

In more detail: both claims are essentially contained in Jacot et al.: at infinite width, they show that the output of a trained network has a simple form of a kernel machine, where the NTK is a fixed kernel: $$f(x_\beta;\theta^*)=f(x_\beta;\theta)- \sum_{\alpha_1, \alpha_2 \in\mathcal{A}} \Theta_{\beta \alpha_1} \Theta^{\alpha_1 \alpha_2} \left(f(x_{\alpha_2};\theta)- y_{\alpha_2} \right)\, .$$ In this PDLT notation, $\mathcal{A}$ is the training set, $x_\beta$ is a test example, $f(x_\beta;\theta)$ is the network's output on the test example at initialization, $ \Theta_{\beta \alpha_1} \equiv \Theta(x_{\beta}, x_{\alpha_1})$ is the NTK evaluated on a test example $x_\beta$ and a training example $x_{\alpha_1}$, $\Theta^{\alpha_1 \alpha_2}$ is the inverse of the $|\mathcal{A}| \times |\mathcal{A}|$-dimensional NTK submatrix evaluated on the training set only, $f(x_{\alpha_2};\theta)$ is output of the network at initialization on the training example $x_{\alpha_2}$, and $y_{\alpha_2}$ is its associated label. (i.e. In the parenthesis in that equation, we have the initial training error.) Thus, this distribution depends only on the network output at initialization and the NTK. In conjunction with the next paragraph, this explains Claim 1.

Now, the definition of the NTK is $$\Theta_{\alpha \beta} \equiv \sum_{\mu}^P \frac{df(x_\alpha; \theta)}{d\theta_\mu} \frac{df(x_\beta;\theta)}{d\theta_\mu}\, ,$$ where $\mu$ indexes the parameters, and so this is a sum of first derivatives over all the parameters of the model. In other words, we see that the NTK is built only from $df/d\theta$. Moreover, Jacot et al. prove that the NTK at infinite width is a deterministic matrix. This implies that it is independent from the network output at initialization, $p(f)$. This fact explains Claims 2.

In PDLT, the output of a trained network at infinite width (the first equation displayed above) is worked out in Chapter 10. The distribution of the network output at initialization is worked out in Chapter 4. The NTK is introduced in Chapter 7, and its distribution -- at infinite and then finite width -- is computed in Chapter 8. The Epilogue also recaps these claims in the context of the results of the book.

Finally, to understand these results and claims, it's also very instructive to think about how the situation changes at finite width. This helps make it clear why the infinite width results are what they are.