What does $ \|f\|_{K^{\infty}} $ mean in this context?

127 Views Asked by At

The following is from the paper The Modern Mathematics of Deep Learning page 23. For the sake of simplicity, I've hidden the details.

The last line is $$\min _{f \in \mathcal{H}_{K} \infty}\|f\|_{K^{\infty}} \quad \text { s.t. } \quad f\left(x^{(i)}\right)=y^{(i)}$$ Is $K^{\infty}$ a kernel when $t\rightarrow{\infty}$? How is $\|\cdot\|_{K^{\infty}}$ defined? If possible, I also want to know how $C(t)$ is derived in the following text.

We consider a one-dimensional prediction setting where the loss $\mathcal{L}(f,(x, y))$ depends on $x \in \mathcal{X}$ only through $f(x) \in \mathcal{Y}$, i.e., there exists a function $\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$ such that $$\mathcal{L}(f,(x, y))=\ell(f(x), y)$$ For instance, in case of the quadratic loss we have that $\ell(\hat{y}, y)=(\hat{y}-y)^{2}$. Further, let $\Phi$ be a NN with architecture $(N, \varrho)=\left(\left(d, N_{1}, \ldots, N_{L-1}, 1\right), \varrho\right)$ and let $\Theta_{0}$ be a $\mathbb{R}^{P(N)}-$ valued random variable. For simplicity, we evolve the parameters of $\Phi$ according to the continuous version of gradient descent, so-called gradient flow, given by $$\frac{\mathrm{d} \Theta(t)}{\mathrm{d} t}=-\nabla_{\theta} \widehat{\mathcal{R}}_{s}(\Phi(\cdot, \Theta(t)))=-\frac{1}{m} \sum_{i=1}^{m} \nabla_{\theta} \Phi\left(x^{(i)}, \Theta(t)\right) D_{i}(t), \quad \Theta(0)=\Theta_{0}$$ where $D_{i}(t):=\left.\frac{\partial \ell\left(\hat{y}, y^{(i)}\right)}{\partial \hat{y}}\right|_{\hat{y}=\Phi\left(x^{(i)}, \Theta(t)\right)}$ is the derivative of the loss with respect to the prediction at input feature $x^{(i)}$ at time $t \in[0, \infty)$. The chain rule implies the following dynamics of the NN realization $$\frac{\mathrm{d} \Phi(\cdot, \Theta(t))}{\mathrm{d} t}=-\frac{1}{m} \sum_{i=1}^{m} K_{\Theta(t)}\left(\cdot, x^{(i)}\right) D_{i}(t)$$ and its empirical risk $$\frac{\mathrm{d} \widehat{\mathcal{R}}_{s}(\Phi(\cdot, \Theta(t))}{\mathrm{d} t}=-\frac{1}{m^{2}} \sum_{i=1}^{m} \sum_{j=1}^{m} D_{i}(t) K_{\Theta(t)}\left(x^{(i)}, x^{(j)}\right) D_{j}(t)$$ where $K_{\theta}, \theta \in \mathbb{R}^{P(N)}$, is the so-called neural tangent kernel (NTK) $$K_{\theta}: \mathbb{R}^{d} \times \mathbb{R}^{d} \rightarrow \mathbb{R}, \quad K_{\theta}\left(x_{1}, x_{2}\right)=\left(\nabla_{\theta} \Phi\left(x_{1}, \theta\right)\right)^{T} \nabla_{\theta} \Phi\left(x_{2}, \theta\right)$$ Now let $\sigma_{w}, \sigma_{b} \in(0, \infty)$ and assume that the initialization $\Theta_{0}$ consists of independent entries, where entries corresponding to the weight matrix and bias vector in the $\ell$ -th layer follow a normal distribution with zero mean and variances $\sigma_{w}^{2} / N_{\ell}$ and $\sigma_{b}^{2}$, respectively. Under weak assumptions on the activation function, the central limit theorem implies that the pre-activations converge to i.i.d. centered Gaussian processes in the infinite width limit $N_{1}, \ldots, N_{L-1} \rightarrow \infty$, see $\left[\mathrm{LBN}^{+} 18, \mathrm{MHR}^{+} 18\right] .$ Similarly, also $K_{\Theta_{0}}$ converges to a deterministic kernel $K^{\infty}$ which stays constant in time and only depends on the activation function $\varrho$, the depth $L$, and the initialization parameters $\sigma_{w}$ and $\sigma_{b}\left[\right.$ JGH18, $\mathrm{ADH}^{+} 19$, Yan19, LXS $^{+} 20$ ]. Thus, within the infinite width limit, gradient flow on the NN parameters as in $(2.1)$ is equivalent to functional gradient flow in the reproducing kernel Hilbert space $\left(\mathcal{H}_{K^{\infty}},\|\cdot\|_{K^{\infty}}\right)$ corresponding to $K^{\infty}$, see $(2.2)$. By (2.3), the empirical risk converges to a global minimum as long as the kernel evaluated at the input features, $\bar{K}^{\infty}:=\left(K^{\infty}\left(x^{(i)}, x^{(j)}\right)\right)_{i, j=1}^{m} \in \mathbb{R}^{m \times m}$, is positive definite (see, e.g., [JGH18, DLL $\left.^{+} 19\right]$ for suitable conditions) and the $\ell\left(\cdot, y^{(i)}\right)$ are convex and lower bounded. For instance, in case of the quadratic loss the solution of $(2.2)$ is then given by $$\Phi(\cdot, \Theta(t))=C(t)\left(y^{(i)}\right)_{i=1}^{m}+\left(\Phi\left(\cdot, \Theta_{0}\right)-C(t)\left(\Phi\left(x^{(i)}, \Theta_{0}\right)\right)_{i=1}^{m}\right)$$ where $C(t):=\left(\left(K^{\infty}\left(\cdot, x^{(i)}\right)\right)_{i=1}^{m}\right)^{T}\left(\bar{K}^{\infty}\right)^{-1}\left(\mathrm{I}_{m}-e^{-\frac{2 \bar{K}^{\infty} t}{m}}\right) .$ As the initial realization $\Phi\left(\cdot, \Theta_{0}\right)$ constitutes a centered Gaussian process, the second term in $(2.5)$ follows a normal distribution with zero mean at each input. In the limit $t \rightarrow \infty$, its variance vanishes on the input features $x^{(i)}, i \in[m]$, and the first term convergences to the minimum kernel-norm interpolator, i.e., to the solution of $$\min _{f \in \mathcal{H}_{K} \infty}\|f\|_{K^{\infty}} \quad \text { s.t. } \quad f\left(x^{(i)}\right)=y^{(i)}$$

2

There are 2 best solutions below

1
On BEST ANSWER

Regarding the first question:

The norm $\|\cdot\|_{K^\infty}$ is given by the reproducing kernel Hilbert space corresponding to the kernel $K^\infty$.

Let us explain the different concepts in the following:

  1. Reproducing kernel Hilbert space:

    The Moore–Aronszajn theorem states that a symmetric, positive definite kernel $K$ on a set $X$ defines a unique Hilbert space $\left(\mathcal{H}_{K},\|\cdot\|_{K}\right)$ of functions on $X$ for which the evaluation functionals $\mathcal{H}\ni f\mapsto f(x)$, $x\in X$, are bounded operators.

    The Hilbert space $\mathcal{H}_{K}$ is given as the completion of $\mathcal{H}_K^0 := \mathrm{span} \big(\{K(x,\cdot) \colon x\in X\} \big)$ w.r.t. the inner product

    $$ \left\langle \sum_{i=1}^m a_i K(x^{(i)},\cdot), \sum_{j=1}^n b_j K(y^{(j)},\cdot) \right \rangle_{K} := \sum_{i=1}^m \sum_{j=1}^n {a_i} b_j K(x^{(i)}, y^{(j)}). $$

  2. The limiting kernel $K^\infty$:

    Under suitable assumptions, the random kernel $K_{\Theta(t)}$ converges in the infinite-width limit $N_1,\dots,N_{L-1}\to \infty$ to the deterministic kernel $K^\infty$ which is given by

    $$K^\infty(x_1,x_2):= \sum_{\ell=1}^L \Sigma^{(\ell)}(x_1,x_2) \prod_{k=\ell+1}^{L} \dot{\Sigma}^{(k)}(x_1,x_2).$$

    In the above, the kernels $\Sigma^{(\ell)}$ and $\dot{\Sigma}^{(k)}$ are defined recursively by

    $$\Sigma^{(\ell+1)} = \sigma_b^2 + \sigma_w^2 \mathbb{L}^{\varrho}_{\Sigma^{(\ell)}} \quad \text{and} \quad \dot{\Sigma}^{(\ell+1)} = \sigma_w^2\mathbb{L}^{\dot{\varrho}}_{\Sigma^{(\ell)}}, $$

    where $\Sigma^{(1)}(x_1,x_2) := \sigma_b^2 + \frac{\sigma_w^2}{d} \langle x_1,x_2 \rangle$, $$\mathbb{L}_\Sigma^f(x_1,x_2):= \mathbb{E}[f(X)f(Y)] \quad \text{with} \quad (X,Y)\sim \mathcal{N}\big(0, (\Sigma(x_i,x_j))_{i,j=1}^2\big),$$ and $\dot{\varrho}$ denotes the derivative of $\varrho$.


Regarding the second question:

One can directly check that the expression for $\Phi(\cdot, \Theta(t))$, which involves the term $C(t)$, satisfies the corresponding differential equation as

  1. the evaluation at $t=0$ equals $\Phi(\cdot, \Theta_0)$ and

  2. the derivative w.r.t. $t$ equals $-\frac{1}{m} \sum_{i=1}^{m} K_{\Theta(t)}\left(\cdot, x^{(i)}\right) D_{i}(t)$. Here, one uses the derivative of the matrix exponential.

2
On

It is defined that $$K_{\theta}: \mathbb{R}^{d} \times \mathbb{R}^{d} \rightarrow \mathbb{R}, \quad K_{\theta}\left(x_{1}, x_{2}\right)=\left(\nabla_{\theta} \Phi\left(x_{1}, \theta\right)\right)^{\top} \nabla_{\theta} \Phi\left(x_{2}, \theta\right)$$ where $\theta$ is a point on the $P(N)$-dimensional real space, so $K_\theta$ is a bivariate function that outputs a real number. We have $K_{\Theta_0(t)}\stackrel{t\to\infty}\to K^\infty$, so in a reproducing kernel Hilbert space, the norm $\|f\|_{K^\infty}$ represents the inner product under the $L^2$-norm. In particular, Mercer's theorem gives $$\|f\|_{K^\infty}=\sqrt{\sum_i\frac{\langle f,\varphi_i\rangle_{L^2}^2}{\lambda_i}}$$ where $\lambda_i$ and $\varphi_i$ are the eigenvalues and eigenvectors respectively of the Hilbert-Schmidt integral operator \begin{align}[T_{K^\infty}\varphi](x)&=\int_\Omega K^\infty(x,s)\varphi(s)\,ds\\&=(\nabla_{\Theta_0(\infty)}\Phi(x,\Theta_0(\infty)))^\top\int_\Omega\nabla_{\Theta_0(\infty)}[\Phi(s,\Theta_0(\infty))]\cdot\varphi(s)\,ds.\end{align} This form is unwieldy so its exact expressions are not used in the paper.