What does $ \|f\|_{K^{\infty}} $ mean in this context?

Question

What does $ \|f\|_{K^{\infty}} $ mean in this context?

127 Views Asked by Bumbble Comm At 26 Mar 2026 - 4:30

The following is from the paper The Modern Mathematics of Deep Learning page 23. For the sake of simplicity, I've hidden the details.

The last line is $$\min _{f \in \mathcal{H}_{K} \infty}\|f\|_{K^{\infty}} \quad \text { s.t. } \quad f\left(x^{(i)}\right)=y^{(i)}$$ Is $K^{\infty}$ a kernel when $t\rightarrow{\infty}$? How is $\|\cdot\|_{K^{\infty}}$ defined? If possible, I also want to know how $C(t)$ is derived in the following text.

We consider a one-dimensional prediction setting where the loss $\mathcal{L}(f,(x, y))$ depends on $x \in \mathcal{X}$ only through $f(x) \in \mathcal{Y}$, i.e., there exists a function $\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb{R}$ such that $$\mathcal{L}(f,(x, y))=\ell(f(x), y)$$ For instance, in case of the quadratic loss we have that $\ell(\hat{y}, y)=(\hat{y}-y)^{2}$. Further, let $\Phi$ be a NN with architecture $(N, \varrho)=\left(\left(d, N_{1}, \ldots, N_{L-1}, 1\right), \varrho\right)$ and let $\Theta_{0}$ be a $\mathbb{R}^{P(N)}-$ valued random variable. For simplicity, we evolve the parameters of $\Phi$ according to the continuous version of gradient descent, so-called gradient flow, given by $$\frac{\mathrm{d} \Theta(t)}{\mathrm{d} t}=-\nabla_{\theta} \widehat{\mathcal{R}}_{s}(\Phi(\cdot, \Theta(t)))=-\frac{1}{m} \sum_{i=1}^{m} \nabla_{\theta} \Phi\left(x^{(i)}, \Theta(t)\right) D_{i}(t), \quad \Theta(0)=\Theta_{0}$$ where $D_{i}(t):=\left.\frac{\partial \ell\left(\hat{y}, y^{(i)}\right)}{\partial \hat{y}}\right|_{\hat{y}=\Phi\left(x^{(i)}, \Theta(t)\right)}$ is the derivative of the loss with respect to the prediction at input feature $x^{(i)}$ at time $t \in[0, \infty)$. The chain rule implies the following dynamics of the NN realization $$\frac{\mathrm{d} \Phi(\cdot, \Theta(t))}{\mathrm{d} t}=-\frac{1}{m} \sum_{i=1}^{m} K_{\Theta(t)}\left(\cdot, x^{(i)}\right) D_{i}(t)$$ and its empirical risk $$\frac{\mathrm{d} \widehat{\mathcal{R}}_{s}(\Phi(\cdot, \Theta(t))}{\mathrm{d} t}=-\frac{1}{m^{2}} \sum_{i=1}^{m} \sum_{j=1}^{m} D_{i}(t) K_{\Theta(t)}\left(x^{(i)}, x^{(j)}\right) D_{j}(t)$$ where $K_{\theta}, \theta \in \mathbb{R}^{P(N)}$, is the so-called neural tangent kernel (NTK) $$K_{\theta}: \mathbb{R}^{d} \times \mathbb{R}^{d} \rightarrow \mathbb{R}, \quad K_{\theta}\left(x_{1}, x_{2}\right)=\left(\nabla_{\theta} \Phi\left(x_{1}, \theta\right)\right)^{T} \nabla_{\theta} \Phi\left(x_{2}, \theta\right)$$ Now let $\sigma_{w}, \sigma_{b} \in(0, \infty)$ and assume that the initialization $\Theta_{0}$ consists of independent entries, where entries corresponding to the weight matrix and bias vector in the $\ell$ -th layer follow a normal distribution with zero mean and variances $\sigma_{w}^{2} / N_{\ell}$ and $\sigma_{b}^{2}$, respectively. Under weak assumptions on the activation function, the central limit theorem implies that the pre-activations converge to i.i.d. centered Gaussian processes in the infinite width limit $N_{1}, \ldots, N_{L-1} \rightarrow \infty$, see $\left[\mathrm{LBN}^{+} 18, \mathrm{MHR}^{+} 18\right] .$ Similarly, also $K_{\Theta_{0}}$ converges to a deterministic kernel $K^{\infty}$ which stays constant in time and only depends on the activation function $\varrho$, the depth $L$, and the initialization parameters $\sigma_{w}$ and $\sigma_{b}\left[\right.$ JGH18, $\mathrm{ADH}^{+} 19$, Yan19, LXS $^{+} 20$ ]. Thus, within the infinite width limit, gradient flow on the NN parameters as in $(2.1)$ is equivalent to functional gradient flow in the reproducing kernel Hilbert space $\left(\mathcal{H}_{K^{\infty}},\|\cdot\|_{K^{\infty}}\right)$ corresponding to $K^{\infty}$, see $(2.2)$. By (2.3), the empirical risk converges to a global minimum as long as the kernel evaluated at the input features, $\bar{K}^{\infty}:=\left(K^{\infty}\left(x^{(i)}, x^{(j)}\right)\right)_{i, j=1}^{m} \in \mathbb{R}^{m \times m}$, is positive definite (see, e.g., [JGH18, DLL $\left.^{+} 19\right]$ for suitable conditions) and the $\ell\left(\cdot, y^{(i)}\right)$ are convex and lower bounded. For instance, in case of the quadratic loss the solution of $(2.2)$ is then given by $$\Phi(\cdot, \Theta(t))=C(t)\left(y^{(i)}\right)_{i=1}^{m}+\left(\Phi\left(\cdot, \Theta_{0}\right)-C(t)\left(\Phi\left(x^{(i)}, \Theta_{0}\right)\right)_{i=1}^{m}\right)$$ where $C(t):=\left(\left(K^{\infty}\left(\cdot, x^{(i)}\right)\right)_{i=1}^{m}\right)^{T}\left(\bar{K}^{\infty}\right)^{-1}\left(\mathrm{I}_{m}-e^{-\frac{2 \bar{K}^{\infty} t}{m}}\right) .$ As the initial realization $\Phi\left(\cdot, \Theta_{0}\right)$ constitutes a centered Gaussian process, the second term in $(2.5)$ follows a normal distribution with zero mean at each input. In the limit $t \rightarrow \infty$, its variance vanishes on the input features $x^{(i)}, i \in[m]$, and the first term convergences to the minimum kernel-norm interpolator, i.e., to the solution of $$\min _{f \in \mathcal{H}_{K} \infty}\|f\|_{K^{\infty}} \quad \text { s.t. } \quad f\left(x^{(i)}\right)=y^{(i)}$$

Original Q&A

There are 2 best solutions below

Bumbble Comm On 03 Jun 2021 - 7:29

It is defined that $$K_{\theta}: \mathbb{R}^{d} \times \mathbb{R}^{d} \rightarrow \mathbb{R}, \quad K_{\theta}\left(x_{1}, x_{2}\right)=\left(\nabla_{\theta} \Phi\left(x_{1}, \theta\right)\right)^{\top} \nabla_{\theta} \Phi\left(x_{2}, \theta\right)$$ where $\theta$ is a point on the $P(N)$-dimensional real space, so $K_\theta$ is a bivariate function that outputs a real number. We have $K_{\Theta_0(t)}\stackrel{t\to\infty}\to K^\infty$, so in a reproducing kernel Hilbert space, the norm $\|f\|_{K^\infty}$ represents the inner product under the $L^2$-norm. In particular, Mercer's theorem gives $$\|f\|_{K^\infty}=\sqrt{\sum_i\frac{\langle f,\varphi_i\rangle_{L^2}^2}{\lambda_i}}$$ where $\lambda_i$ and $\varphi_i$ are the eigenvalues and eigenvectors respectively of the Hilbert-Schmidt integral operator \begin{align}[T_{K^\infty}\varphi](x)&=\int_\Omega K^\infty(x,s)\varphi(s)\,ds\\&=(\nabla_{\Theta_0(\infty)}\Phi(x,\Theta_0(\infty)))^\top\int_\Omega\nabla_{\Theta_0(\infty)}[\Phi(s,\Theta_0(\infty))]\cdot\varphi(s)\,ds.\end{align} This form is unwieldy so its exact expressions are not used in the paper.

**Bumbble Comm** · Accepted Answer

Regarding the first question:

The norm $\|\cdot\|_{K^\infty}$ is given by the reproducing kernel Hilbert space corresponding to the kernel $K^\infty$.

Let us explain the different concepts in the following:

Reproducing kernel Hilbert space:

The Moore–Aronszajn theorem states that a symmetric, positive definite kernel $K$ on a set $X$ defines a unique Hilbert space $\left(\mathcal{H}_{K},\|\cdot\|_{K}\right)$ of functions on $X$ for which the evaluation functionals $\mathcal{H}\ni f\mapsto f(x)$, $x\in X$, are bounded operators.

The Hilbert space $\mathcal{H}_{K}$ is given as the completion of $\mathcal{H}_K^0 := \mathrm{span} \big(\{K(x,\cdot) \colon x\in X\} \big)$ w.r.t. the inner product

$$ \left\langle \sum_{i=1}^m a_i K(x^{(i)},\cdot), \sum_{j=1}^n b_j K(y^{(j)},\cdot) \right \rangle_{K} := \sum_{i=1}^m \sum_{j=1}^n {a_i} b_j K(x^{(i)}, y^{(j)}). $$
The limiting kernel $K^\infty$:

Under suitable assumptions, the random kernel $K_{\Theta(t)}$ converges in the infinite-width limit $N_1,\dots,N_{L-1}\to \infty$ to the deterministic kernel $K^\infty$ which is given by

$$K^\infty(x_1,x_2):= \sum_{\ell=1}^L \Sigma^{(\ell)}(x_1,x_2) \prod_{k=\ell+1}^{L} \dot{\Sigma}^{(k)}(x_1,x_2).$$

In the above, the kernels $\Sigma^{(\ell)}$ and $\dot{\Sigma}^{(k)}$ are defined recursively by

$$\Sigma^{(\ell+1)} = \sigma_b^2 + \sigma_w^2 \mathbb{L}^{\varrho}_{\Sigma^{(\ell)}} \quad \text{and} \quad \dot{\Sigma}^{(\ell+1)} = \sigma_w^2\mathbb{L}^{\dot{\varrho}}_{\Sigma^{(\ell)}}, $$

where $\Sigma^{(1)}(x_1,x_2) := \sigma_b^2 + \frac{\sigma_w^2}{d} \langle x_1,x_2 \rangle$, $$\mathbb{L}_\Sigma^f(x_1,x_2):= \mathbb{E}[f(X)f(Y)] \quad \text{with} \quad (X,Y)\sim \mathcal{N}\big(0, (\Sigma(x_i,x_j))_{i,j=1}^2\big),$$ and $\dot{\varrho}$ denotes the derivative of $\varrho$.

Regarding the second question:

One can directly check that the expression for $\Phi(\cdot, \Theta(t))$, which involves the term $C(t)$, satisfies the corresponding differential equation as

the evaluation at $t=0$ equals $\Phi(\cdot, \Theta_0)$ and
the derivative w.r.t. $t$ equals $-\frac{1}{m} \sum_{i=1}^{m} K_{\Theta(t)}\left(\cdot, x^{(i)}\right) D_{i}(t)$. Here, one uses the derivative of the matrix exponential.

What does $ \|f\|_{K^{\infty}} $ mean in this context?

There are 2 best solutions below

Related Questions in NOTATION

Related Questions in NORMED-SPACES

Related Questions in MACHINE-LEARNING

Related Questions in REPRODUCING-KERNEL-HILBERT-SPACES

Trending Questions

Popular # Hahtags

Popular Questions