My question is, is the original hard-margin support vector classifier optimal in some sense? If you have an answer that refers to the soft-margin SVC instead, I'd also be interested.
I know that the SVC has advantages:
– it does not depend on specific distributional assumptions
– it deals gracefully and without special tricks with the $p > n$ situation
– it produces a solution which is sparse w.r.t. the training data
– it integrates naturally with the "kernel trick".
I know that the SVC performs well in many practical problems.
I know that it is possible to derive diverse upper bounds on the generalization error of the SVC.
I know that the SVC is based on an optimization process: finding the separating hyperplane with the largest margin.
… but I'm unclear about the statistical motivation of the margin maximization approach itself.
Of course it makes intuitive sense: First make sure that all training data points are correctly classified (separating hyperplane), then choose the hyperplane that least invites any ambiguity as to how the training data should be classified. But is this more than a plausible heuristic? Is there e.g. a theorem that shows that maximizing the margin leads to the smallest generalization error under some set of assumptions?
Hastie et al. write in The Elements of Statistical Learning:
Not only does this provide a unique solution to the separating hyperplane problem, but by maximizing the margin between the two classes on the training data, this leads to better classification performance on test data.
…
The intuition is that a large margin on the training data will lead to good separation on the test data.
This suggests the "plausible heuristic" interpretation. They further write:
Vapnik’s structural risk minimization (SRM) approach fits a nested sequence of models of increasing VC dimensions $h_1 < h_2 < \cdots$ , and then chooses the model with the smallest value of the upper bound. … An example in which the structural risk minimization program can be successfully carried out is the support vector classifier …
This sounds like that what makes the SVC special is that it can be analyzed by the SRM approach – but not necessarily that it is SRM-optimal, because alternatives cannot be analyzed.
Can anyone shed light on this?