Lagrange Multipliers and Hard Margin SVMs

220 Views Asked by At

With hard margin support vector machines (SVMs), it suffices to find the critical points of the Lagrangian $L = \frac{1}{2}||\theta||^2 - \sum_{n=1}^{N} \alpha_n (y^{(n)} (\vec{\theta}^T\vec{x}^{(n)} + \theta_0) - 1)$ over all $ \alpha_n$, where $\vec{\theta}^T\vec{x}^{(n)} + \theta_0 = 0$ is the seperating hyperplane, $y^{(n)}$ is the label (1 or -1), and there are $N$ data points.

I don't have much of a background in optimization, however, so I wanted to ask for methods that are commonly used to solve this optimization problem. For example, would gradient descent work (it doesn't seem like it would)?

1

There are 1 best solutions below

0
On BEST ANSWER

The optimisation problem we are solving here reads: $$ \underset{\theta}{\text{min }} \underset{\alpha \geq 0}{\text{max }} \underbrace{\frac{1}{2} \|\theta\|^2 + \sum_{n=1}^N \alpha_n (1-y^{(n)} (\langle \theta, x^{(n)} \rangle ))}_{L(\theta, \alpha)} $$ where I dropped the vector notation and considered the case for $\theta_0 = 0$ for simplicity. Now assume that $$\underset{\theta}{\text{min }} \underset{\alpha \geq 0}{\text{max }} L(\theta, \alpha) = \underset{\alpha \geq 0}{\text{max }} \underset{\theta}{\text{min }} L(\theta, \alpha)$$

Then, we can simplify the optimisation problem by noting that, for a fixed $\alpha$, the minimum of the loss function can be obtained by equating its gradient to zero. Doing that we get: $$ \theta = \sum_{n=1}^N \alpha_n y^{(n)} x^{(n)} $$ Thus, our optimisation problem reduces to: $$ \underset{\alpha \geq 0}{\text{max }} \underbrace{\frac{1}{2} \|\sum_{n=1}^N \alpha_n y^{(n)} x^{(n)} \|^2 + \sum_{n=1}^N \sum_{m=1}^N \alpha_n (1-\alpha_m y^{(n)}y^{(m)} (\langle x^{(m)}, x^{(n)} \rangle ))}_{L(\alpha)} $$ which is a quadratic problem that can be solved using quadratic programming algorithms. One such algorithm is SMO, which was developed to solve the SVM optimisation problem.

would gradient descent work?

Yes! However, you often consider the primary optimisation problem for that. Details on how we can apply stochastic gradient descent to SVM can be found in "Understanding Machine Learning" by S. S. Schwartz. This approach is less commonly used than SMO.

Remark

$\theta = \sum_{n=1}^N \alpha_n y^{(n)} x^{(n)}$ means that the best parameters belong to the linear span of some parts of our training dataset. This is a very important result that is actually a special case of the so-called representer theorems. Note also that when substituting this into our loss function we get: $\langle x^{(m)}, x^{(n)} \rangle$, i.e., inner products of our training data. This is a key observation to develop kernel methods.