I'm reading this tutorial about SVMs.
I'd like to have two clarifications:
- at page 4 (bottom), why is that, after using (1.10) the summation is extended to only $m \in S$? In (1.10) the summation applies to all elements of $L$, and I don't think ${\mathbf x}_m\cdot {\mathbf x}_s = 0$ is necessarily true.
- page 5: why is taking the average better? Isn't the value of $b$ supposed to be unique?
$1$. The reason is that for any non-support vector ${\mathbf x}_i$, we will have $\alpha_i = 0$. This is the way Lagrange multipliers work for inequality contraints such as we do here. Only the vectors for which the inequality becomes an equality, which are support vectors by definition, have a non-zero Lagrange multiplier. A constraint with a zero Lagrange multipliers implies that it's not "active" in the optimal solution, or not making any difference to it, which is as you'd expect here for vectors not on hyperplane $H_1$ or $H_2$.
For examples of this: Kuhn-Tucker examples
$2$. I think in nearly all cases the value of $b$ will be the same for all the support vectors. I think the averaging idea is just a safeguard against anomalies in the numerical calculation of $b$. I agree with you that in theory, all the $b$'s should have the same value. See Page 565 of these notes where there is an example of such calculations. There is a small variation in the numerical values of $b$ for each support vector. Only tiny in this example, but one might be able to construct cases where the discrepencies become a problem. Taking the average minimises the risk of inadvertently striking such cases.