I'm aware this phenomenon is usually called overfitting, but since any polynomial can be expressed with the coefficients of a polynomial of higher order by setting the order higher-order coefficients to zero, increasing the order of polynomial regression should simply add "more" applicable regression functions to the model. How can the regression error get worse sometimes if you could simply re-take any best lower-order fits with a smaller regression error?
Why does polynomial regression yield worse results if you increase the order by too much?
127 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 3 best solutions below
On
What happens is that the equation system becomes under-determined. We don't have enough information to fill it with.
For example a polynomial of order $k$ has $k+1$ degrees of freedom. If we have fewer data points than $k+1$ our equation system becomes under-determined. There could be infinitely many solutions.
What we need to do then is to add a regularization term to encourage the solution we find to be "well behaved" in some sense of the word.
A typical simple regularizer is to just add an identity matrix times some small
Instead of solving $$M^tMv = M^tb$$
we solve : $$(M^tM + \lambda I)v = M^t b$$
for some small value of $\lambda$, maybe one in a thousand or one in a million.
For polynomials it may be valuable to punish the coefficients of the higher order terms more harshly. In this case we can replace $I$ matrix above with a diagonal $D$ matrix which has larger values for the positions that correspond to the monomials of highest order. This will be a kind of a soft-version of the solution you mention, to just lower the maximum order.
Instead of removing the highest order term completely (which would be same as setting coefficient to 0) we make it cost more for it to be large. So it kind of gets pushed down harder towards 0 than the other terms do when solving the system but not forced to be exactly 0.
On
Consider fitting a degree $10$ polynomial onto these $10$ blue points that came from some sort of a measurement. The result is the red curve:
However, as these points came from some measurement, they have some inherent error. Now consider fitting a degree $3$ polynomial onto them (green curve):
If we keep continuing our measurement to add more blue points, the green curve will generalize much better:
The newly added light blue points are what you would call a "test set". While the fitted red curve works perfectly for the original (dark blue) points (which is called the "train set"), it really fails on the newly added test points. As an example, these points are nowhere near where the red curve would predict:
Specifically, the red dashed lines show some of the test set error on $3$ chosen points.




If a low degree polynomial exactly fits the data then fitting a potentially high degree polynomial to the data will just yield that low degree polynomial back (ignoring possible arithmetic error).
But usually the low degree polynomial is not an exact fit. In this case a higher degree polynomial fit is a better fit to the data (if it weren't then a lower degree result would get selected as you said). But the higher degree fit is often a worse interpolant due to large oscillations between data points. You are in effect "fitting to the noise" rather than "fitting to the signal".