I am using Penalized spline to smooth noisy data. Those splines are non parametric regression models which only rely on a smoothing parameter $\lambda \geq 0$ (which has to be chosen).
I would like to find the "best" parameter that would describe well data without overfiting it or underfiting.
Let $(x_i,y_i)_{i\in[1,n]}$ be a set of points such that $x_1 < x_2 <\dots < x_n$ . Once $\lambda$ is chosen the corresponding penalized spline curve $f$ found minimizes $\displaystyle \sum_{i=1}^{n} (y_i - f(x_i))^2 + \lambda \int_{x_1}^{x_n} [f''(x)]^2 dx$
Basically $\lambda$ controls the trade-off between the goodness of fit and the smoothness of the fitted curve.
- if $\lambda= 0$, $f$ interpolate data hence it overfits
- if $\lambda\to +\infty$ the integral as too become smaller and smaller: $f$ obtained is a linear regression hence underfit data.
I would like to find a way to find the optimal value of $\lambda$ but I cannot use a $\chi^2$ tests otherwise the solution found would be $\lambda = 0$.
I found a criterion called cross-validated residual sum of squares (CVRSS) to minimize and the minimum corresponds to the best value $\lambda$ (Method described in Annex B of the following document http://curis.ku.dk/ws/files/164301283/thesis_helene_rytgaard.pdf) But it gives me a corresponding $f$ curve which overfit too much data in my case.
Many thanks in advance if you have a solution to propose to this problem