Approximating a large number of data points using (cubic) splines in l1/l2 norm.

2.9k Views Asked by At

I have a pretty large dataset ($x,y$) consisting of a few million points. There is a lot of noise in the data. I want to find a smooth but simple approximation/representation for this dataset, so that for a given value of $x$ I have a decent estimate of $y$. I expect the data to have some kind of a cubic/non-linear relationship (hence simple linear regression won't work and since I want a simple representation, kernels are probably ruled out). I want to try doing this using cubic splines with a few knots (say $5$).

In other words, I want to find the cubic spline approximation with a specified number of knots that minimizes the l1 or l2 norm with respect to the input data points. Do you have any ideas about how to do this? Any pointers will be pretty useful. I found a couple of pointers online for approximating using a cubic spline, but none of them seemed scalable for the amount of data I have.

Thanks a lot! Gaurav

PS: Both $x$ and $y$ are scalar i.e. I have a simple 2D dataset, and am just trying to learn a simple representation for the non linear relationship between $x$ and $y$.

1

There are 1 best solutions below

0
On

If you're using Python, scipy.interpolate.LSQUnivariateSpline can easily handle 1M points:

""" LSQUnivariateSpline( x, y, user knots ) """
# http://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.LSQUnivariateSpline.html
# cf https://github.com/scipy/scipy/issues/2611  many knots -> all NaN

from __future__ import division
import sys
from time import time
import numpy as np
import pylab as pl
from scipy.interpolate import LSQUnivariateSpline

__date__ = "2013-08-02 aug denis"

    # LSQSpline nx points, interpolate at nsample --
nx = 1e6
nsample = 1000
ncycle = 1
noise = .5
nknot = 5
plot = 0
seed = 0

exec( "\n".join( sys.argv[1:] ))  # run this.py nx= ...  from sh or ipython
np.set_printoptions( 1, threshold=100, edgeitems=10, suppress=True )
np.random.seed(seed)
nx = int(nx)

title = "scipy.interpolate.LSQUnivariateSpline  nx %d  noise %.2g  %d knots" % (
    nx, noise, nknot )
print 80 * "-"
print title

def func( x ):
    return np.sin( 2*np.pi * ncycle * x )

x = np.sort( np.random.uniform( size=nx ))
xs = np.linspace( 0, 1, nsample )
y = func(x) + noise * np.random.randn(nx)

#...............................................................................
knots = np.linspace( xs[1], xs[-2], nknot )
    # knots = np.linspace( xs[0], xs[-1], nknot )
    # ValueError: Interior knots t must satisfy Schoenberg-Whitney conditions
t0 = time()

spline = LSQUnivariateSpline( x, y, knots, k=3 )
ys = spline(xs)  # interpolate at xs
print "%.2g sec  setup + interpolate %d points" % (time() - t0, nsample)
print "y interpolated:", ys
yexact = func(xs)
averr = np.fabs( ys - yexact ).mean()
print "av |yspline - yexact| %.2g" % averr

if plot:
    pl.title( title )
    pl.plot( ys, label="spline" )
    pl.plot( yexact, label="exact" )
    # pl.plot( ys - func(xs), label="diff" )
    pl.legend()
    pl.show()

Without Python (what language are you using ?), start with knots [ 0 .2 .4 .6 .8 1] and fit 5 separate cubics in each interval -- least-squares fit [1 x x^2 x^3].
If the ends are close enough at .2 .4 .6 .8, you're done.
If not, set up a global least-squares with 5 * 4 columns in all, plus dummy points at and near the knots with weights 1000000 to force continuity and smoothness there.

If the points x are very non-uniformly spaced, use non-uniform intervals, e.g.
np.percentile( x, [0,20,40,60,80,100] ) .

If you really want L1 approximation, I believe a cheap near-L1 can be done by iterative L2 with weighting; ask a new question on that.

Hope this helps. You might ask on stackoverflow, see
https://stackoverflow.com/search?q=[spline]+knots ,
and of course google "spline (intro|tutorial) user-knots" .