Penalized form of the nonparametric least-squares estimate

47 Views Asked by At

Consider the class of twice continuously differentiable functions $f: [0,1]\to R$ and for a given squared radius $R>0$, define the function class $$ \mathcal{F}(R):=\left\{f:[0,1]\to R: \int_0^1(f''(x))^2dx\le R\right\} $$ In this case, the penalized form of the nonparametric least-squares estimate is given by $$ \hat{f}\in \arg\min_f \left\{\frac{1}{n} \sum_{i=1}^n (y_i-f(x_i))^2+\lambda_n\int_0^1(f''(x))^2dx\right\} $$ where $\lambda_n>0$. The kernel is defined as $$ K(x,z)=\int_0^1(x-y)_{+}(z-y)_{+}dy $$

(1) Show that the optimal solution must take the form $$ \hat{f}(x)=\hat{\theta}_0+\hat{\theta}_1x+\frac{1}{\sqrt{n}}\sum_{i=1}\hat{\alpha}_iK(x,x_i)$$ for some vectors $\hat{\theta}\in R^2$ and $\hat{\alpha}\in R^n$.

(2)Show that these vectors can be obtained by solving the quadratic program $$ (\hat{\theta},\hat{\alpha})=\arg\min_{(\theta,\alpha)\in R^2\times R^n}\left\{\frac{1}{2n}\|y-X\theta-\sqrt{n}K\alpha\|_2^2+\lambda_n\alpha^TK\alpha\right\} $$ where $K\in R^{n\times n}$ is the kernel matrix by the kernel function as before, and $X\in R^{n\times 2}$ is a design matrix with i-th row given by $[1, x_i]$.


My work: Note that $K(x,z)$ is a kernel of $H^2[0,1]$ the collection of functions of $f$ that are twice continuously differentiable with $f(0)=f'(0)=0$.

We have $$ <f,g>=\int_0^1 f''(x)g''(x)dx $$ Since $f(0)=f'(0)=0$, the linear function in $H^2$ is 0. We can decompose $f\in H^2$ into a linear function and a function $f_0\in H^2$ with $$ \|f_0\|=\int_0^1(f''_0(x))^2dx $$

If I let $J: f\mapsto \frac{1}{n} \sum_{i=1}^n (y_i-f(x_i))^2+\lambda_n\int_0^1(f''(x))^2dx$, how to do next step?

I try to decompose $f\in H^2$ into a linear function $f_1(x)=ax+b$ and $f_0$. So $$ f(x_i)=<f_1(x), K(\cdot, x_i)>+<f_0(x), K(\cdot, x_i)> $$

1

There are 1 best solutions below

0
On BEST ANSWER

Step 1, Sanity Check : Let $H^2$ be the Hilbert space of (piecewise, otherwise we have a problem) twice differentiable functions satisfying $f(0)=f'(0)=0$ with the inner product you defined. We first want to check that $K$ is indeed a well-defined kernel of $H^2$, so for any $x_i$, consider the function $K(\cdot,x_i)$ defined on $[0,1]$ by

$$\begin{align}K(x,x_i) &= \int_0^1(x-t)_{+}(x_i-t)_{+}dt \\ &=\mathbf{1}_{x\le x_i}\cdot\int_0^{x}(x-t)(x_i-t)dt + \mathbf{1}_{x_i\le x}\cdot\int_0^{x_i}(x-t)(x_i-t)dt\end{align} $$ $K(\cdot,x_i)$ is piecewise twice differentiable and its (piecewise) first and second derivatives w.r.t $x$ can be computed with Leibniz rule : $$\begin{align}K'(x,x_i)&= 0 +\mathbf{1}_{x\le x_i}\cdot\int_0^{x}\frac{\partial}{\partial x}(x-t)(x_i-t)dt + \mathbf{1}_{x_i\le x}\cdot\int_0^{x_i}\frac{\partial}{\partial x}(x-t)(x_i-t)dt \\ &=\mathbf{1}_{x\le x_i}\left(xx_i - x^2/2\right) + \mathbf{1}_{x_i\le x}x_i^2/2\\ K''(x,x_i) &=\mathbf{1}_{x\le x_i}\left(x_i - x\right) \end{align} $$ With these computations out of the way, it only requires little algebra to show that $K$ is indeed a well defined kernel of $H^2$ (namely that $K(\cdot,x)$ is an element of $H^2$ and $K$ satisfies the reproducing property).


Step 2, Representer Theorem : Once step 1 is done, you fall back in the usual setting. Again here, you should try to decompose $f$ in two parts, but you should rather consider the decomposition as $f=f_s+f_\perp$ where $f_s\in\mathrm{span}\{x\mapsto a,x\mapsto bx,K(\cdot,x_1),\ldots,K(\cdot,x_n)\}$ and $f_\perp$ is an element of the orthogonal of that set. First note that thanks to step 1, the objective can be rewritten as $$\hat{f}\in \arg\min_f \underbrace{\left\{\frac{1}{n} \sum_{i=1}^n (y_i-f(x_i))^2+\lambda_n\|f\|_H^2\right\}}_{J(f)} $$ By using the reproducing property and the definitions of $f_s$ and $f_\perp$, you can show (exactly as done here) that $f(x_i) = f_s(x_i)$ for all $x_i$.
You also have by orthogonality that $\|f\|_H^2 = \|f_s\|_H^2 +\|f_\perp\|_H^2 \ge\|f_s\|_H^2 $.
Hence $$\begin{align}J(f_s)-J(f)&=\lambda_n\|f_s\|_H^2 -\lambda_n\|f\|_H^2\\ &= \lambda_n(\|f_s\|_H^2 -\|f\|_H^2) \le 0\end{align} $$ Which proves the desired result. (remark : by the same reasoning, you can add any number of functions to the hypothesis space to find optimal solutions, as long as all the $K(\cdot,x_i)$ are included).


Step 3, Linear Programming Representation : This part is basically linear algebra. Thanks to step 2 above, we have that $$\begin{align}\|f\|_H^2 &= \left\langle\theta_0 + \theta_1(\cdot) + \sum_{i=1}^n\alpha_i K(\cdot,x_i),\theta_0 + \theta_1(\cdot) + \sum_{j=1}^n\alpha_j K(\cdot,x_j)\right\rangle\\ &=\left\langle\ \sum_{i=1}^n\alpha_i K(\cdot,x_i), \sum_{j=1}^n\alpha_j K(\cdot,x_j)\right\rangle \text{($\theta_0 + \theta_1(\cdot)$ is $\perp$ to everyone) }\\ &=\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_j \langle K(\cdot,x_i),K(\cdot,x_j)\rangle\\ &=\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_j K(x_i,x_j) = \alpha^T K\alpha \end{align} $$ Similarly, the first term of $J(f)$ can be seen as $1/n$ times the squared Euclidean norm of the vector whose components are $y_i - f(x_i)$. But you can easily check with the representation of $f$ we have proven, that the components of the vector $y-X\theta-\sqrt n K\alpha$ are exactly what you want, which concludes the proof.