I have just finished a code that performs polynomial regression, doing $(X'X)^{-1}X'y$ (where $X'$ is the transpose) to estimate the vector of coefficients.
Now I'd like to add some check procedures to assert that everything is correct and that the regression model can be used with confidence. From Wikipedia I know that "This is the unique least squares solution as long as $X$ has linearly independent columns. Since $X$ is a Vandermonde matrix, this is guaranteed to hold provided that at least $m + 1$ of the $x_i$ are distinct (for which $m < n$ is a necessary condition)." So I guess that a good first step would be to check whether $m$ is indeed less than $n$... I could also tell the user that the degree of the regression shouldn't be too high, to avoid overfitting the data.
The thing is, everything should be hidden from the user... I can't ask him to perform cross-validation. On the other hand, I could write a little leave-one-out cross validation code which would be run on the whole training data every time a new model is created.
Any thought or suggestion ?
Thanks
1- Let the matrices $X_{train}$ and $Y_{train}$ be your training data, and $X_{tune}$ and $Y_{tune}$ be the held-out data that is not included in the training data.
Solve $\hat{\theta}=(X_{train}^TX_{train})^{\dagger}X_{train}^TY_{train}$
Now you regression model is" $Y=X\hat{\theta}$.
Use this regression model to estimate the outputs for the held-out data:
$\hat{Y}_{tune}=X_{tune}\hat{\theta}$
Afterwards, calculate $corr(\hat{Y}_{tune},Y_{tune})$, the correlation coefficient between $\hat{Y}_{tune}$ and $Y_{tune}$, if the correlation is greater than some threshold, then you have found a good fit. (Threshold=.8 may be a good guess).
2- A question that you might have is about the best number of polynomial terms to use when forming your input Vandermonde matrix. Here is my suggestion, let $k$ be the number of columns in $X$, then start from $k=1$ and record the resulting correlation, keep increasing $k$ until some maximum integer (say k=1000, but it really depends on the complexity of the underlying function that generates your data.) Finally choose the $k$ that yields the maximum correlation.
3- By $(X_{train}^TX_{train})^{\dagger}$, I mean the pseudo-inverse of $(X_{train}^TX_{train})$. Note that $(X_{train}^TX_{train})$ is always PSD, but not necessarily PD. So if its determinant is zero you can either use the Moore–Penrose pseudoinverse, or use the Tikhonov regularization, which is the foillowing:
$(X_{train}^TX_{train})^{\dagger}\approx (X_{train}^TX_{train}+\epsilon I)^{-1}$.
You can also tune $\epsilon$ by tuning on the held-out data.
4- For choosing either $k$ or $\epsilon$, if your sample size is small, I recommend using cross-validation, rather than only working on one held-out dataset.
Hope this helps.