By the representer theorem of RKHS, the solution of kernel ridge-regression in RKHS, $$\underset{f\in \mathcal{H}}{\min} \sum_{i=1}^N (f(x_i) - y_i)^2 +\lambda\|f\|_\mathcal{H},$$ where $\mathcal{H}$ is an RKHS, is given by a linear combination of the kernel functions, i.e., $\hat{f} = \sum_{i=1}^N c_i k(\cdot, x_i)$. This seems like finding a function to interpolate $y_i$'s. Is there any specific benefit/insight of such a formulation using reproducing kernel, rather than using other nonlinear functions such as Fourier basis? Or in general, what is the benefit of using kernel methods in RKHS?
2026-03-29 07:21:24.1774768884
Any specific advantages/insights for kernel Ridge-regression in RKHS?
500 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
There are 1 best solutions below
Related Questions in REGRESSION
- How do you calculate the horizontal asymptote for a declining exponential?
- Linear regression where the error is modified
- Statistics - regression, calculating variance
- Why does ANOVA (and related modeling) exist as a separate technique when we have regression?
- Gaussian Processes Regression with multiple input frequencies
- Convergence of linear regression coefficients
- The Linear Regression model is computed well only with uncorrelated variables
- How does the probabilistic interpretation of least squares for linear regression works?
- How to statistically estimate multiple linear coefficients?
- Ridge Regression in Hilbert Space (RKHS)
Related Questions in MACHINE-LEARNING
- KL divergence between two multivariate Bernoulli distribution
- Can someone explain the calculus within this gradient descent function?
- Gaussian Processes Regression with multiple input frequencies
- Kernel functions for vectors in discrete spaces
- Estimate $P(A_1|A_2 \cup A_3 \cup A_4...)$, given $P(A_i|A_j)$
- Relationship between Training Neural Networks and Calculus of Variations
- How does maximum a posteriori estimation (MAP) differs from maximum likelihood estimation (MLE)
- To find the new weights of an error function by minimizing it
- How to calculate Vapnik-Chervonenkis dimension?
- maximize a posteriori
Related Questions in REPRODUCING-KERNEL-HILBERT-SPACES
- Prove strictly positive-definite kernel
- Example of an infinite dimensional Hilbert space that is not an RKHS
- Is a Reproducing Kernel Hilbert Space just a Hilbert space equipped with an "indexed basis"?
- RKHS rough definition
- Does the optimal function in kernel method have a sparse representation?
- A biliographic inquiry into Fredholm's kernel
- Orthonormal system $\{ e_n(t)\}_{n=1}^{\infty}$ is complete $\Leftrightarrow$ $k(t,t) = \sum_{n=0}^{\infty}{|e_n(t)|^2}, \forall t \in \Omega$
- Invertibility of Grammian in Reproducing Kernel Hilbert Space
- Linear regression with feature representation confusion - is design matrix column space the feature space?
- Is there a positive-semidefinite convolution kernel, that is continuous at $0$ but discontinuous elsewhere?
Trending Questions
- Induction on the number of equations
- How to convince a math teacher of this simple and obvious fact?
- Find $E[XY|Y+Z=1 ]$
- Refuting the Anti-Cantor Cranks
- What are imaginary numbers?
- Determine the adjoint of $\tilde Q(x)$ for $\tilde Q(x)u:=(Qu)(x)$ where $Q:U→L^2(Ω,ℝ^d$ is a Hilbert-Schmidt operator and $U$ is a Hilbert space
- Why does this innovative method of subtraction from a third grader always work?
- How do we know that the number $1$ is not equal to the number $-1$?
- What are the Implications of having VΩ as a model for a theory?
- Defining a Galois Field based on primitive element versus polynomial?
- Can't find the relationship between two columns of numbers. Please Help
- Is computer science a branch of mathematics?
- Is there a bijection of $\mathbb{R}^n$ with itself such that the forward map is connected but the inverse is not?
- Identification of a quadrilateral as a trapezoid, rectangle, or square
- Generator of inertia group in function field extension
Popular # Hahtags
second-order-logic
numerical-methods
puzzle
logic
probability
number-theory
winding-number
real-analysis
integration
calculus
complex-analysis
sequences-and-series
proof-writing
set-theory
functions
homotopy-theory
elementary-number-theory
ordinary-differential-equations
circles
derivatives
game-theory
definite-integrals
elementary-set-theory
limits
multivariable-calculus
geometry
algebraic-number-theory
proof-verification
partial-derivative
algebra-precalculus
Popular Questions
- What is the integral of 1/x?
- How many squares actually ARE in this picture? Is this a trick question with no right answer?
- Is a matrix multiplied with its transpose something special?
- What is the difference between independent and mutually exclusive events?
- Visually stunning math concepts which are easy to explain
- taylor series of $\ln(1+x)$?
- How to tell if a set of vectors spans a space?
- Calculus question taking derivative to find horizontal tangent line
- How to determine if a function is one-to-one?
- Determine if vectors are linearly independent
- What does it mean to have a determinant equal to zero?
- Is this Batman equation for real?
- How to find perpendicular vector to another vector?
- How to find mean and median from histogram
- How many sides does a circle have?
It is fast because we are effectively solving a linear problem. And yet we get a high performing non linear function. This is crucial for any application (think of an SVM classifying millions of data points). Any minimizer of the functional $$\mathcal F (f) = \frac{1}{n} \sum\limits_{i=1}^{n} l(f(X_i), Y_i) + G(\|f\|^2_{\mathcal H_K})$$ where $l$ is a loss function, $G$ is a strictly non-decreasing function of the squared norm of $f$ in the RKHS with kernel $K$ has the form $$f^*(.) = \sum\limits_{i=1}^{n} \alpha_i K (x_i, . )$$
This is called the representer theorem. In the proof below, we see that what we are actually doing is minimizing over a subspace of ${\mathcal H_K}$, i.e $S= \operatorname{span}\{ K(x_i, .) | 1 \leq i \leq n\}$, and the repesenter theorem guarantees that the minimizer we obtain is the minimizer over the whole space ${\mathcal H_K}$. If you proceed like this with another subspace, say the span of the Legendre polynomials or other Fourier basis, it does not hold in general that the minimizer over that subspace is actually the global minimizer.
The representer theorem has huge implications in practice. Let us look at an example. Let $\mathcal H_K$ be a RKHS with kernel $K$. Suppose we are interested in minimizing a regularization of the squared error risk
$$\hat f_{n,\lambda} = \operatorname{argmin}_{f \in \mathcal H_K} \frac{1}{n} \sum\limits_{i=1}^{n} (f(x_i) - y_i)^2 + \lambda \|f\|_{\mathcal H_K}^2$$
Then we know that any minimizer $f$ has the form
$$ f(.) = \sum\limits_{j=1}^{n} \alpha_j K(x_j,.)$$
Observe that
$$f(x_i) = \sum\limits_{j=1}^{n} \alpha_j K(x_j,x_i) = \sum\limits_{j=1}^{n} \alpha_j K(x_i,x_j)$$
and that
\begin{align} \|f\| ^2_{\mathcal H_K} &= \langle { f, f }\rangle_{\mathcal H_K} \\ &= \Big\langle { \sum\limits_{i=1}^{n} \alpha_i K(x_i,.), \sum\limits_{j=1}^{n} \alpha_j K(x_j,.) }\Big\rangle_{\mathcal H_K} \\ &= \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} \alpha_i \alpha_j \big\langle {K(x_i,.), K(x_j,.) }\big\rangle_{\mathcal H_K} \end{align}
Using matrix notation $\mathbb K_{ij}:=\big\langle K(x_i,.), K(x_j,.) \big\rangle_{\mathcal H_K}$, we get
$$\hat f_{n,\lambda} = \sum\limits_{j=1}^{n} \alpha_j^* K(x_j,.)$$
where
$$\mathbb \alpha^* = \operatorname{argmin}_{\mathbb \alpha} \frac{1}{n} \|{\mathbb Y-\mathbb K \mathbb \alpha}\|^2 + \lambda \mathbb \alpha \mathbb K \mathbb \alpha $$
This is a convex differentiable cost function, we can find its minimum by setting its gradient to zero
$$ - \mathbb K ( \mathbb Y - \mathbb K \alpha^*) + \lambda n \mathbb K \alpha^* = 0 $$
which implies, when matrix $\mathbb K$ is strictly positive definite, that
$$\alpha^* = ( \mathbb K + \lambda n \mathbb I )^{-1} \mathbb Y$$
where $\mathbb I$ is the identity matrix. This is a $n \times n$ linear system. It can be solved numerically by means of Cholesky factorization (complexity of $O(n^3)$).
Once the $\alpha^*$ computed, we can use $\hat f_{n,\lambda}$ on new data
$$\hat f_{n,\lambda} (x_{\text{new}}) = \sum\limits_{j=1}^{n} \alpha_j^* K(x_j,x_{\text{new}}) = \mathbb K_{x_{\text{new}}} \alpha^* = \mathbb K_{x_{\text{new}}}( \mathbb K + \lambda n \mathbb I )^{-1} \mathbb Y$$
This operation has cost $O(n^2)$ for each new example.
Here is the proof of the representer theorem where we explicitly use the reproducing property.
We project $f$ onto the subspace
$$S= \operatorname{span} \{ K(x_i, .) | 1 \leq i \leq n\}$$
By the orthogonal decomposition theorem in Hilbert spaces, we have $f= f_1 +f_2$, where $f_1$ is the component in $S$ and $f_2$ the one orthogonal to $S$. By Pythagoras theorem, we have
$$ \|f\|^2 \geq \|f_1\|^2$$
Since $G$ is non-decreasing, we have
$$G(\|f\|^2_{\mathcal H_K}) \geq G(\|f_1\|^2_{\mathcal H_K})$$
Notice that (using the reproducing property)
$$f(x_i)= \langle f, K(x_i,.) \rangle = \langle f_1, K(x_i,.) \rangle + \langle f_2, K(x_i,.) \rangle = \langle f_1, K(x_i,.) \rangle = f_1(x_i)$$
This implies that
$$l(f(x_i), y_i) = l(f_1(x_i), y_i)$$
So the loss function only depends on the component in $S$ and $G$ is minimized if $f$ lies in $S$. Hence we can find a minimizer in $S$
$$f^*(.) = \sum\limits_{i=1}^{n} \alpha_i K(x_i,.)$$
As $G$ is strictly non-decreasing, $||f_2||$ must be zero for $f$ to be a minimizer of $\mathcal F (f)$. In conclusion, any minimizer has the above form.