Polynomial regression - correctness and accuracy

Question

Polynomial regression - correctness and accuracy

943 Views Asked by Bumbble Comm At 03 Apr 2026 - 7:27

I have just finished a code that performs polynomial regression, doing $(X'X)^{-1}X'y$ (where $X'$ is the transpose) to estimate the vector of coefficients.

Now I'd like to add some check procedures to assert that everything is correct and that the regression model can be used with confidence. From Wikipedia I know that "This is the unique least squares solution as long as $X$ has linearly independent columns. Since $X$ is a Vandermonde matrix, this is guaranteed to hold provided that at least $m + 1$ of the $x_i$ are distinct (for which $m < n$ is a necessary condition)." So I guess that a good first step would be to check whether $m$ is indeed less than $n$... I could also tell the user that the degree of the regression shouldn't be too high, to avoid overfitting the data.

The thing is, everything should be hidden from the user... I can't ask him to perform cross-validation. On the other hand, I could write a little leave-one-out cross validation code which would be run on the whole training data every time a new model is created.

Any thought or suggestion ?

Thanks

Original Q&A

There are 2 best solutions below

Bumbble Comm On 21 Jan 2014 - 8:11

This answer is based on a software development practice rather than anything from mathematics.

It sounds like you have two categories of question:

1) Is the approach mathematically correct?

and

2) Is the code producing the correct result?

You can answer both questions via non-mathematical means: write a test suite and run it each time you update your code. E.g., for a polynomial you know, generate points for which the best fit is that polynomial. Then check that your code finds the correct one. You can programmatically test a few hundred or thousand cases -- and a good test suite will check edge and corner cases as well.

**Bumbble Comm** · Accepted Answer

1- Let the matrices $X_{train}$ and $Y_{train}$ be your training data, and $X_{tune}$ and $Y_{tune}$ be the held-out data that is not included in the training data.

Solve $\hat{\theta}=(X_{train}^TX_{train})^{\dagger}X_{train}^TY_{train}$

Now you regression model is" $Y=X\hat{\theta}$.

Use this regression model to estimate the outputs for the held-out data:

$\hat{Y}_{tune}=X_{tune}\hat{\theta}$

Afterwards, calculate $corr(\hat{Y}_{tune},Y_{tune})$, the correlation coefficient between $\hat{Y}_{tune}$ and $Y_{tune}$, if the correlation is greater than some threshold, then you have found a good fit. (Threshold=.8 may be a good guess).

2- A question that you might have is about the best number of polynomial terms to use when forming your input Vandermonde matrix. Here is my suggestion, let $k$ be the number of columns in $X$, then start from $k=1$ and record the resulting correlation, keep increasing $k$ until some maximum integer (say k=1000, but it really depends on the complexity of the underlying function that generates your data.) Finally choose the $k$ that yields the maximum correlation.

3- By $(X_{train}^TX_{train})^{\dagger}$, I mean the pseudo-inverse of $(X_{train}^TX_{train})$. Note that $(X_{train}^TX_{train})$ is always PSD, but not necessarily PD. So if its determinant is zero you can either use the Moore–Penrose pseudoinverse, or use the Tikhonov regularization, which is the foillowing:

$(X_{train}^TX_{train})^{\dagger}\approx (X_{train}^TX_{train}+\epsilon I)^{-1}$.

You can also tune $\epsilon$ by tuning on the held-out data.

4- For choosing either $k$ or $\epsilon$, if your sample size is small, I recommend using cross-validation, rather than only working on one held-out dataset.

Hope this helps.

Polynomial regression - correctness and accuracy

There are 2 best solutions below

Related Questions in REGRESSION

Trending Questions

Popular # Hahtags

Popular Questions