I am using regularized least squares more specifically Generalized Tikhonov Regularization on real dataset where rows << cols:
$$=(A^TA+\lambda I)^{-1}(A^Tb)$$
I am implementing it using C by invoking LAPACK routines. For factoring and solving the system, I am using LU decomposition with partial pivoting by invoking DGESV.
I am trying to have different values for the regularize coefficient and each time I am calculating mean square error (MSE) for training set and for testing set.
Conceptually, as regularize coefficient $\lambda$ got smaller $\lambda \to 0$, MSE becomes small and close to zero. This means that the solution X is overfitting dataset.
I don't have such behavior. For example MSE for $\lambda=0.0001$ and $\lambda=0.0$ are the same and it is big ($MSE=0.05$ on the training dataset, and $MSE=0.07$ on the testing dataset).
Could anyone explain for me why I have the same MSE for different regularize coefficient $\lambda$? Could this because of the nonlinearty of dataset?
Well, first I would notice that, although I don't know the magnitude of the data in your dataset, the two $\lambda$s are very close indeed, and if you think at the closed form of expression of the Ridge estimator doesn't seem so surprising that they induce the almost identical MSE: there are possibly some decimals missing here.
Additionally, note that it is not necessarily true that the smaller $\lambda$, the smaller the test MSE. The choice of the optimal $\lambda$, indeed, relies in the Bias - Variance Trade-Off: the higher $\lambda$, the more the Bias and the smaller the Variance. It is totally possible, as such, that up to a certain point, increasing $\lambda$ induces a reduction in the Variance - via the shrinkage of the Ridge coefficients - that overtakes the increase in the Bias, hence reducing the test MSE which, keep in mind, is an increasing function in both the Variance and the Bias squared.