I have a pretty large dataset ($x,y$) consisting of a few million points. There is a lot of noise in the data. I want to find a smooth but simple approximation/representation for this dataset, so that for a given value of $x$ I have a decent estimate of $y$. I expect the data to have some kind of a cubic/non-linear relationship (hence simple linear regression won't work and since I want a simple representation, kernels are probably ruled out). I want to try doing this using cubic splines with a few knots (say $5$).
In other words, I want to find the cubic spline approximation with a specified number of knots that minimizes the l1 or l2 norm with respect to the input data points. Do you have any ideas about how to do this? Any pointers will be pretty useful. I found a couple of pointers online for approximating using a cubic spline, but none of them seemed scalable for the amount of data I have.
Thanks a lot! Gaurav
PS: Both $x$ and $y$ are scalar i.e. I have a simple 2D dataset, and am just trying to learn a simple representation for the non linear relationship between $x$ and $y$.
If you're using Python, scipy.interpolate.LSQUnivariateSpline can easily handle 1M points:
Without Python (what language are you using ?), start with knots [ 0 .2 .4 .6 .8 1] and fit 5 separate cubics in each interval -- least-squares fit [1 x x^2 x^3].
If the ends are close enough at .2 .4 .6 .8, you're done.
If not, set up a global least-squares with 5 * 4 columns in all, plus dummy points at and near the knots with weights 1000000 to force continuity and smoothness there.
If the points
xare very non-uniformly spaced, use non-uniform intervals, e.g.np.percentile( x, [0,20,40,60,80,100] ).If you really want L1 approximation, I believe a cheap near-L1 can be done by iterative L2 with weighting; ask a new question on that.
Hope this helps. You might ask on stackoverflow, see
https://stackoverflow.com/search?q=[spline]+knots ,
and of course google "spline (intro|tutorial) user-knots" .