Why it is necessary to maximize and then minimize Lagrangian in hard margin SVM?

Question

Why it is necessary to maximize and then minimize Lagrangian in hard margin SVM?

1.2k Views Asked by Bumbble Comm At 10 May 2026 - 6:44

I was learning about support vector machines from Andrew Ng video lectures. I figured it out. I understand why we try to minimize $\frac{1}{2} w^2$.

Margin (width) of the support vector is $2/\|w\|$. We want to maximize width it means that we also want to minimize $\|w\|$ from the equation $2/\|w\|$. It is true that we also want to minimize $\frac{1}{2}\|w\|^2$. Now we use Lagrange expression here, $$L = \frac{1}{2}\|w\|^2 + \sum_i α_i(y_i(w \bullet x+b)-1).$$ To find the minimum of $\frac{1}{2}\|w\|^2$, we apply gradient; $∇L = 0$. From the $∇L = 0$ equation, we get $Σ_iα_iy_i = 0$ and $w = Σ_iαi_iy_ix_i$. Then by using $Σ_iα_iy_i = 0$ and $w = Σ_iα_iy_ix_i$ equations and putting this in the Lagrange expression we end up with $$L(w,b,α) = \sum_iα_i-\frac{1}{2}\sum_{i,j}y_iy_jα_iα_j(x_i\bullet x_j).$$

I really understand up to here, but the professor said that we want to maximize $L$ (w.r.t $α$) and at the same time minimize it w.r.t $w$ and $b$ here, so the final Lagrangian primal problem becomes $$min_{w,b}max_α L(w, b, α)$$ $$s.t.$$ $$α_i >= 0, i=1,2 ..., m$$ why? I really don't understand. Why are we trying to maximize and then minimize the Lagrange expression ($L$)? Any intuitive explanation will help.

Original Q&A

There are 2 best solutions below

Bumbble Comm On 26 May 2018 - 4:58

You may find it useful to look at these notes. They are from the same course that you're following!

There are several confusions in your post. First of all, when we talk about the "Lagrangian" function, we should be referring to the function in your first equation: $$L(w, b, \alpha_i) = \frac{1}{2}\|w\|^2 - \sum_i α_i(y_i(w . x_i+b)-1).$$

Having defined the Lagrangian function properly, we will now discuss two different, but closely related, optimisation problems.

The primal problem: $$ \min_{w, b} \left( \max_{\alpha_i; \alpha_i \geq 0} L(w,b,\alpha_i)\right)$$

How can we interpret this problem? If you think about it, $$ \max_{\alpha_i; \alpha_i \geq 0} L(w,b,\alpha_i) = \begin{cases} \frac 1 2 \| w \|^2 & {\rm if \ \ } y_i (w.x_i + b) \geq 1 \\ + \infty & {\rm otherwise}\end{cases}$$

So the primal problem is equivalent to finding $$ \min_{w, b} \tfrac 1 2 \| w\|^2 $$ subject to the constraint that $$ y_i(w.x_i + b) \geq 1.$$

In other words, this primal problem is the same original optimisation problem that you encountered in your machine learning context: it is the problem of finding the hyperplane that gives the biggest separation between the two classes of datapoints.

The dual problem:

$$ \max_{\alpha_i; \alpha_i \geq 0}\left( \min_{w, b} L(w,b,\alpha_i)\right) $$

Unlike the primal problem, this dual problem does not have a simple immediate interpretation in your machine learning context. We can, however, go ahead and simplifying this problem algebraically. As you pointed out,

$$ \min_{w, b} L(w,b,\alpha_i) = \begin{cases} \sum_i\alpha_i-\frac{1}{2}\sum_{i,j}y_iy_j\alpha_i\alpha_j(x_i . x_j) & {\rm if \ \ } \sum_i \alpha_i y_i = 0 \\ - \infty & {\rm otherwise} \end{cases}$$ which means that this dual problem is equivalent to finding

$$ \max_{\alpha_i; \alpha_i \geq 0} \left(\sum_i\alpha_i-\tfrac{1}{2}\sum_{i,j}y_iy_j\alpha_i\alpha_j(x_i . x_j) \right).$$ subject to $$ \sum_i \alpha_i y_i = 0.$$

So far, we have discussed two problems. We have discussed the primal problem (which is the original problem relevant in your machine learning context), and we have also discussed the dual problem (which looks abstract and not immediately useful). What is the relationship between these two problems?

It turns out the two problems are related by a theorem called the Karush-Kuhn-Tucker (KKT) theorem. You can read about it in the notes that I linked at the top of this answer.

The precise statement of KKT, when applied to our current setup, is that if $w^\star, b^\star$ solves the primal problem $$ w^\star, b^\star = {\rm argmin}_{w, b; \ \ y_i(w.x_i + b) \geq 1} \left(\tfrac 1 2 \|w\|^2\right) $$ and if $\alpha_i^\star$ solves the dual problem $$\alpha_i^\star = {\rm argmax}_{\alpha; \alpha \geq 0, \sum_i \alpha_i y_i = 0} \left(\sum_i\alpha_i-\tfrac{1}{2}\sum_{i,j}y_iy_j\alpha_i\alpha_j(x_i . x_j) \right), $$ then $ \ w^\star, b^\star, \alpha_i^\star$ obey the following four conditions: \begin{align} \alpha^\star_i & \geq 0 \\ y_i(w^\star.x_i + b^\star) & \geq 1 \\ \alpha^\star_i (y_i(w^\star.x_i + b^\star) - 1) & = 0 \\ \nabla_w L(w^\star,b^\star,\alpha_i^\star) & = \nabla_b L(w^\star,b^\star,\alpha_i^\star) = 0 \end{align}

[The converse is also true: if $w^\star, b^\star, \alpha_i^\star$ satisfy the four KKT conditions, then $w^\star, b^\star$ solves the primal problem and $\alpha_i^\star$ solves the dual problem.]

How can we use this result in practice? Remember, the ultimate goal in our machine learning context is to solve the primal problem. Although this is possible, it turns out that algorithms for solving the primal problem are little slow. It's much faster to solve the dual problem instead. (The fast algorithm that can be applied to the dual problem is called "sequential minimal optimisation". And by the way, the dual problem also lends itself to a generalisation of linear SVMs called the "kernel" trick, but I won't go into that here.)

Once you have solved the dual problem, you can immediately read off the solution to the primal problem using the Karush-Kuhn-Tucker (KKT) theorem. To spell it out, once you have found $\alpha^\star_i$ that solve the dual problem, you can infer the value of $w^\star$ that solves the primal problem using the fourth KKT equation, $$ \nabla_w L(w^\star,b^\star,\alpha^\star_i) = w^\star - \sum_i \alpha^\star_i y_i x_i = 0, $$ You can then read off the value of $b^\star$ that solves the primal problem using the third KKT equation, which tells you that for any $i$ such that $\alpha^\star_i > 0$, you must have $$ y_i (w^\star.x_i + b^\star) = 1.$$ (The datapoints corresponding to these $i$'s are known as the "support vectors" - they are the "critical" datapoints that sit on the margins.)

The machine learning notes for your course have more details about all of these steps.

**Bumbble Comm** · Accepted Answer

@Kenny Wong answer is very detailled.

I will try to add more explanation about the why there is a minimization and a maximization (which I believe is the problem you are struggling to understand) in the formula.

Given the function $$L(w, b, \alpha_i) = \frac{1}{2}\|w\|^2 - \sum_i α_i(y_i(w . x_i+b)-1)$$

The key here is to see that the problem:

$$\min_{w, b} \left( \max_{\alpha_i; \alpha_i \geq 0} L(w,b,\alpha_i)\right)$$

is equivalent to the problem: $$\min_{w, b} \tfrac 1 2 \| w\|^2$$ subject to $$y_i(w.x_i + b) \geq 1.$$

Indeed if we only consider $$\max_{\alpha_i; \alpha_i \geq 0} L(w,b,\alpha_i)$$ and select a $w$ and a $b$ such that $$ y_i (w.x_i + b) \lt 1$$ then we can easily make $L$ arbitrary large just by selecting the appropriate $\alpha_i$.

For instance imagine we select $w$ and a $b$ such that $$ y_i (w.x_i + b) = 0.5$$ $$L(w, b, \alpha_i) = \frac{1}{2}\|w\|^2 - \sum_i α_i(0.5-1)$$

$$L(w, b, \alpha_i) = \frac{1}{2}\|w\|^2 - \sum_i -0.5α_i$$

Now imagine that there are three $α_i$ and they all are equals to 1

Then in this case we would have $$L(w, b, \alpha_i) = \frac{1}{2}\|w\|^2 - (-1.5)$$ $$L(w, b, \alpha_i) = \frac{1}{2}\|w\|^2 + 1.5$$

If we say that $α_i$ are equals to 2 then we have

$$L(w, b, \alpha_i) = \frac{1}{2}\|w\|^2 + 3$$

and so on...

That is what the " $ + \infty\space\text{otherwise} $ " line means in

$$\max_{\alpha_i; \alpha_i \geq 0} L(w,b,\alpha_i) = \begin{cases} \frac 1 2 \| w \|^2 & {\rm if \ \ } y_i (w.x_i + b) \geq 1 \\ + \infty & {\rm otherwise}\end{cases}$$

So if we just maximize the lagrangian, we would not minimize $$\min_{w, b} \tfrac 1 2 \| w\|^2$$ subject to $$y_i(w.x_i + b) \geq 1.$$ which is our original goal.

Let us now consider the other case. If we select $w$ and a $b$ such that $$ y_i (w.x_i + b) = 3$$

Then we will have

$$L(w, b, \alpha_i) = \frac{1}{2}\|w\|^2 - \sum_i α_i(3-1)$$

$$L(w, b, \alpha_i) = \frac{1}{2}\|w\|^2 - \sum_i 2α_i$$

Because $\alpha_i; \alpha_i \geq 0$ then $$- \sum_i 2α_i$$ will always be zero or negative.

So the maximum of $L(w, b, \alpha_i)$ is always $\frac{1}{2}\|w\|^2$ when $y_i(w.x_i + b) \geq 1$.

Recall that we try to minimize $$\frac{1}{2}\|w\|^2$$ because we just saw that given the constraint $y_i(w.x_i + b) \geq 1$ we have $$\frac{1}{2}\|w\|^2 = \max_{\alpha_i; \alpha_i \geq 0} L(w,b,\alpha_i)$$ then what we need to do is $$\min_{w, b} \tfrac 1 2 \| w\|^2 = \min_{w, b} \left( \max_{\alpha_i; \alpha_i \geq 0} L(w,b,\alpha_i)\right)$$

Why it is necessary to maximize and then minimize Lagrangian in hard margin SVM?

There are 2 best solutions below

Related Questions in CALCULUS

Related Questions in CONVEX-OPTIMIZATION

Related Questions in LAGRANGE-MULTIPLIER

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions