Hyper-parameter optimization for regression to avoid overfiting/underfiting

32 Views Asked by At

I am using Penalized spline to smooth noisy data. Those splines are non parametric regression models which only rely on a smoothing parameter $\lambda \geq 0$ (which has to be chosen).

I would like to find the "best" parameter that would describe well data without overfiting it or underfiting.

Let $(x_i,y_i)_{i\in[1,n]}$ be a set of points such that $x_1 < x_2 <\dots < x_n$ . Once $\lambda$ is chosen the corresponding penalized spline curve $f$ found minimizes $\displaystyle \sum_{i=1}^{n} (y_i - f(x_i))^2 + \lambda \int_{x_1}^{x_n} [f''(x)]^2 dx$

Basically $\lambda$ controls the trade-off between the goodness of fit and the smoothness of the fitted curve.

  • if $\lambda= 0$, $f$ interpolate data hence it overfits
  • if $\lambda\to +\infty$ the integral as too become smaller and smaller: $f$ obtained is a linear regression hence underfit data.

I would like to find a way to find the optimal value of $\lambda$ but I cannot use a $\chi^2$ tests otherwise the solution found would be $\lambda = 0$.

I found a criterion called cross-validated residual sum of squares (CVRSS) to minimize and the minimum corresponds to the best value $\lambda$ (Method described in Annex B of the following document http://curis.ku.dk/ws/files/164301283/thesis_helene_rytgaard.pdf) But it gives me a corresponding $f$ curve which overfit too much data in my case.

Many thanks in advance if you have a solution to propose to this problem