By the representer theorem of RKHS, the solution of kernel ridge-regression in RKHS, $$\underset{f\in \mathcal{H}}{\min} \sum_{i=1}^N (f(x_i) - y_i)^2 +\lambda\|f\|_\mathcal{H},$$ where $\mathcal{H}$ is an RKHS, is given by a linear combination of the kernel functions, i.e., $\hat{f} = \sum_{i=1}^N c_i k(\cdot, x_i)$. This seems like finding a function to interpolate $y_i$'s. Is there any specific benefit/insight of such a formulation using reproducing kernel, rather than using other nonlinear functions such as Fourier basis? Or in general, what is the benefit of using kernel methods in RKHS?
2025-04-20 02:14:18.1745115258
Any specific advantages/insights for kernel Ridge-regression in RKHS?
515 Views Asked by mw19930312 https://math.techqa.club/user/mw19930312/detail At
1
There are 1 best solutions below
Related Questions in REGRESSION
- small sample (1 point) MLE estimation
- For linear regression: compute $\Theta T X$
- Nonlinear LS regression
- Least-squares fit of a nonlinear (polar) system
- Discrete version of continuous SIR model
- How to get the relative contribution of each variable in a difference that forms the numerator?
- Why is $R^2=\rho^2$
- Please show that $f(\beta_0,\beta_1)=\log(1+\operatorname{exp}(-y_1(\beta_0+\beta_1 x_1)))+\log(1+\operatorname{exp}(-y_2(\beta_0+\beta_1 x_2)))$
- standardised random variable least square regression $X$ against $Y$, $Y$ against $X$
- Interpreting OLS Regression Coefficients with High Multicolinearity
Related Questions in MACHINE-LEARNING
- For linear regression: compute $\Theta T X$
- How to evaluate the quality of the probability distribution output of a classifier?
- Boltzmann and Ising
- solving a collaborative filtering problem
- What comes after sorting eigenvalues in PCA?
- MSE of an estimator as sum of bias and variance
- What is the derivative of the cross entropy loss when extending an arbitrary predictor for multi class classification?
- Energy functions for CRF/MRF
- VC-Dimension of Real Linear Classifier Proof
- Conditioning multivariate Gaussian on a function of coordinates
Related Questions in REPRODUCING-KERNEL-HILBERT-SPACES
- General approach to check if the given function is NOT a valid kernel
- Approximation to the $n$-th derivative using reproducing kernels.
- When is a feature map continuous?
- Relationship between ergodicity and RKHS of a Gaussian process
- Help understanding Reproducing Kernel Hilbert spaces?
- problem over positive definite kernel.
- uniform limit of kernels
- What is the relationship between a kernel function and the inner product of the corresponding Reproducing Kernel Hilbert Space?
- Any specific advantages/insights for kernel Ridge-regression in RKHS?
- RKHS Analog for Non-symmetric Kernel
Trending Questions
- Induction on the number of equations
- How to convince a math teacher of this simple and obvious fact?
- Refuting the Anti-Cantor Cranks
- Find $E[XY|Y+Z=1 ]$
- Determine the adjoint of $\tilde Q(x)$ for $\tilde Q(x)u:=(Qu)(x)$ where $Q:U→L^2(Ω,ℝ^d$ is a Hilbert-Schmidt operator and $U$ is a Hilbert space
- Why does this innovative method of subtraction from a third grader always work?
- What are the Implications of having VΩ as a model for a theory?
- How do we know that the number $1$ is not equal to the number $-1$?
- Defining a Galois Field based on primitive element versus polynomial?
- Is computer science a branch of mathematics?
- Can't find the relationship between two columns of numbers. Please Help
- Is there a bijection of $\mathbb{R}^n$ with itself such that the forward map is connected but the inverse is not?
- Identification of a quadrilateral as a trapezoid, rectangle, or square
- A community project: prove (or disprove) that $\sum_{n\geq 1}\frac{\sin(2^n)}{n}$ is convergent
- Alternative way of expressing a quantied statement with "Some"
Popular # Hahtags
real-analysis
calculus
linear-algebra
probability
abstract-algebra
integration
sequences-and-series
combinatorics
general-topology
matrices
functional-analysis
complex-analysis
geometry
group-theory
algebra-precalculus
probability-theory
ordinary-differential-equations
limits
analysis
number-theory
measure-theory
elementary-number-theory
statistics
multivariable-calculus
functions
derivatives
discrete-mathematics
differential-geometry
inequality
trigonometry
Popular Questions
- How many squares actually ARE in this picture? Is this a trick question with no right answer?
- What is the difference between independent and mutually exclusive events?
- Visually stunning math concepts which are easy to explain
- taylor series of $\ln(1+x)$?
- Determine if vectors are linearly independent
- What does it mean to have a determinant equal to zero?
- How to find mean and median from histogram
- Difference between "≈", "≃", and "≅"
- Easy way of memorizing values of sine, cosine, and tangent
- How to calculate the intersection of two planes?
- What does "∈" mean?
- If you roll a fair six sided die twice, what's the probability that you get the same number both times?
- Probability of getting exactly 2 heads in 3 coins tossed with order not important?
- Fourier transform for dummies
- Limit of $(1+ x/n)^n$ when $n$ tends to infinity
It is fast because we are effectively solving a linear problem. And yet we get a high performing non linear function. This is crucial for any application (think of an SVM classifying millions of data points). Any minimizer of the functional $$\mathcal F (f) = \frac{1}{n} \sum\limits_{i=1}^{n} l(f(X_i), Y_i) + G(\|f\|^2_{\mathcal H_K})$$ where $l$ is a loss function, $G$ is a strictly non-decreasing function of the squared norm of $f$ in the RKHS with kernel $K$ has the form $$f^*(.) = \sum\limits_{i=1}^{n} \alpha_i K (x_i, . )$$
This is called the representer theorem. In the proof below, we see that what we are actually doing is minimizing over a subspace of ${\mathcal H_K}$, i.e $S= \operatorname{span}\{ K(x_i, .) | 1 \leq i \leq n\}$, and the repesenter theorem guarantees that the minimizer we obtain is the minimizer over the whole space ${\mathcal H_K}$. If you proceed like this with another subspace, say the span of the Legendre polynomials or other Fourier basis, it does not hold in general that the minimizer over that subspace is actually the global minimizer.
The representer theorem has huge implications in practice. Let us look at an example. Let $\mathcal H_K$ be a RKHS with kernel $K$. Suppose we are interested in minimizing a regularization of the squared error risk
$$\hat f_{n,\lambda} = \operatorname{argmin}_{f \in \mathcal H_K} \frac{1}{n} \sum\limits_{i=1}^{n} (f(x_i) - y_i)^2 + \lambda \|f\|_{\mathcal H_K}^2$$
Then we know that any minimizer $f$ has the form
$$ f(.) = \sum\limits_{j=1}^{n} \alpha_j K(x_j,.)$$
Observe that
$$f(x_i) = \sum\limits_{j=1}^{n} \alpha_j K(x_j,x_i) = \sum\limits_{j=1}^{n} \alpha_j K(x_i,x_j)$$
and that
\begin{align} \|f\| ^2_{\mathcal H_K} &= \langle { f, f }\rangle_{\mathcal H_K} \\ &= \Big\langle { \sum\limits_{i=1}^{n} \alpha_i K(x_i,.), \sum\limits_{j=1}^{n} \alpha_j K(x_j,.) }\Big\rangle_{\mathcal H_K} \\ &= \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} \alpha_i \alpha_j \big\langle {K(x_i,.), K(x_j,.) }\big\rangle_{\mathcal H_K} \end{align}
Using matrix notation $\mathbb K_{ij}:=\big\langle K(x_i,.), K(x_j,.) \big\rangle_{\mathcal H_K}$, we get
$$\hat f_{n,\lambda} = \sum\limits_{j=1}^{n} \alpha_j^* K(x_j,.)$$
where
$$\mathbb \alpha^* = \operatorname{argmin}_{\mathbb \alpha} \frac{1}{n} \|{\mathbb Y-\mathbb K \mathbb \alpha}\|^2 + \lambda \mathbb \alpha \mathbb K \mathbb \alpha $$
This is a convex differentiable cost function, we can find its minimum by setting its gradient to zero
$$ - \mathbb K ( \mathbb Y - \mathbb K \alpha^*) + \lambda n \mathbb K \alpha^* = 0 $$
which implies, when matrix $\mathbb K$ is strictly positive definite, that
$$\alpha^* = ( \mathbb K + \lambda n \mathbb I )^{-1} \mathbb Y$$
where $\mathbb I$ is the identity matrix. This is a $n \times n$ linear system. It can be solved numerically by means of Cholesky factorization (complexity of $O(n^3)$).
Once the $\alpha^*$ computed, we can use $\hat f_{n,\lambda}$ on new data
$$\hat f_{n,\lambda} (x_{\text{new}}) = \sum\limits_{j=1}^{n} \alpha_j^* K(x_j,x_{\text{new}}) = \mathbb K_{x_{\text{new}}} \alpha^* = \mathbb K_{x_{\text{new}}}( \mathbb K + \lambda n \mathbb I )^{-1} \mathbb Y$$
This operation has cost $O(n^2)$ for each new example.
Here is the proof of the representer theorem where we explicitly use the reproducing property.
We project $f$ onto the subspace
$$S= \operatorname{span} \{ K(x_i, .) | 1 \leq i \leq n\}$$
By the orthogonal decomposition theorem in Hilbert spaces, we have $f= f_1 +f_2$, where $f_1$ is the component in $S$ and $f_2$ the one orthogonal to $S$. By Pythagoras theorem, we have
$$ \|f\|^2 \geq \|f_1\|^2$$
Since $G$ is non-decreasing, we have
$$G(\|f\|^2_{\mathcal H_K}) \geq G(\|f_1\|^2_{\mathcal H_K})$$
Notice that (using the reproducing property)
$$f(x_i)= \langle f, K(x_i,.) \rangle = \langle f_1, K(x_i,.) \rangle + \langle f_2, K(x_i,.) \rangle = \langle f_1, K(x_i,.) \rangle = f_1(x_i)$$
This implies that
$$l(f(x_i), y_i) = l(f_1(x_i), y_i)$$
So the loss function only depends on the component in $S$ and $G$ is minimized if $f$ lies in $S$. Hence we can find a minimizer in $S$
$$f^*(.) = \sum\limits_{i=1}^{n} \alpha_i K(x_i,.)$$
As $G$ is strictly non-decreasing, $||f_2||$ must be zero for $f$ to be a minimizer of $\mathcal F (f)$. In conclusion, any minimizer has the above form.