why do we need regularization when there is a lot of data

942 Views Asked by At

I am reading the textbook Machine Learning - A Probability Perspective, and in Chapter 8 (which talks about logistic regression) there is a paragraph that says:

Just as we prefer ridge regression to linear regression, so we should prefer MAP estimation for logistic regression to computing the MLE. In fact, regularization is important in the classification setting even if we have lots of data. To see why, suppose the data is linearly separable. In this case, the MLE is obtained when $\lVert w \lVert \rightarrow \infty$ ,corresponding to an infinitely steep sigmoid function, $I(w^Tx > w_0)$, also known as linear threshold unit. This assigns the maximal amount of probability mass to the training data. However such a solution is very brittle and will not generalize well.

Could somebody explain why the MLE for logistic regression is obtained when $ \lVert w \lVert \rightarrow \infty$ and why this assigns the maximal amount of probability mass to the training data?

1

There are 1 best solutions below

1
On BEST ANSWER

The key assumption here is that the data are linearly separable, meaning that there exists an $w_0$ such that $I(w_0^T x_i>0)=y_i$ for all $i=1,\cdots,n$. If this is the case, the MLE is ill-defined because $cw_0$ has higher likelihood than $w_0$ on the data if $c>1$, and hence the MLE must be $\|w_0\|\to\infty$.

However, keep in mind that MLAPP is from a firm Bayesian perspective, and as a frequentist I do not agree with this argument (that justifies the use of significant regularization for large data sets) at all :) From a frequentist perspective, if the logistic regression model is true then as the data size grows larger, it is less likely that the data are linearly separable at all. As a result, asymptotically speaking MLE is the most statistically efficient estimator in a logistic regression model with fixed dimension.