Equivalent formulation of linear discrimination problem on Boyd convex optimization slides

168 Views Asked by At

In Boyd's CVX slides on pg 189 he has the linear discrimination problem http://stanford.edu/class/ee364a/lectures.html

Given data $\{x_1, \ldots, x_n\}$, $\{y_1, \ldots, y_n\}$

The problem of finding a hyperplane $(a,b)$ that strictly separates the two data set has two equivalent formulations:

  1. $a^Tx_i +b > 0, \forall i = 1, \ldots, N\quad$ $a^Ty_i +b < 0, \forall i = 1, \ldots, M$

And

  1. $a^Tx_i +b \geq 1, \forall i = 1, \ldots, N\quad $ $a^Ty_i +b \leq -1, \forall i = 1, \ldots, M$

Frankly I have talked to so many people about this question and no one has ever able to offer a coherent or mathematically sensible reason for why this is the case. I asked my professor about it and he sort of hand waved and said that something about epsilon and deltas, wasn't very helpful.

Couple confusions:

  1. How do you mathmatically go from 1. to 2.? Why bother? What is so wrong about the original question?
  2. Why $\geq 1$ and $\leq -1$ in particular? If you can go from $>0, <0$ to $\geq 1, \leq -1$ , can you also go from $>0, <0$ to $> \infty, < -\infty$ ?
  3. Wouldn't you take the risk of having some data sort of stuck between $\geq 1, \leq -1$ if you rewrite the problem in this way? Hence fails the discrimination/classification

100 points to whoever can offer even a semi-coherent argument

1

There are 1 best solutions below

9
On BEST ANSWER

Why are these models equivalent?

What does it mean for two models to be equivalent? Here's my answer: two models are "equivalent" if the solution to one model readily leads to the solution of the other, and vice versa.

Suppose you have a solution to (1) above: that is, you have an $(a,b)$ pair that satisfies $$a^Tx_i+b>0 ~~ \forall i=1,\dots, M, \quad a^Ty_i+b<0 ~~\forall i=1,\dots, N$$ Well, those inequalities still hold if we multiply them by any positive value $\alpha$. Put another way, for any $\alpha>0$, $(\bar{a},\bar{b})\triangleq (\alpha a,\alpha b)$ also satisfies those same inequalities. In fact, $(\bar{a},\bar{b})$ describe the same hyperplane. There is a degree of freedom here that we can adjust without changing the solution.

So now let's go back to your original solution $(a,b)$ and perform this computation: $$\beta_x \triangleq \min_i a^T x_i + b, \quad \beta_y \triangleq \max_i a^T y_i + b, \quad \beta \triangleq \min\{\beta_x,-\beta_y\}$$ It should be clear that $\beta_x>0$, $\beta_y<0$, and $\beta>0.$ Now let's define $(\bar{a},\bar{b})=(\beta^{-1}a,\beta^{-1} b)$. Then $$ \bar{a}^T x_i + \bar{b} = \beta^{-1} (a^T x_i + b) \geq \beta^{-1} \min_j (a^T x_j + b) = \beta^{-1} \beta_x \geq 1 \quad \forall i=1,\dots, M$$ $$ \bar{a}^T y_i + \bar{b} = \beta^{-1} (a^T y_i + b) \leq \beta^{-1} \max_j (a^T y_j + b) = \beta^{-1} \beta_y\leq -1 \quad \forall i=1,\dots, N$$ Eliminating the intermediate results above, we have just shown that $$ \bar{a}^T x_i + \bar{b} \geq 1 ~~ \forall i=1,\dots, M, \quad \bar{a}^T y_i + \bar{b} \leq -1 ~~ \forall i=1,\dots, N$$ which is exactly the model proposed in (2). So: given any solution to (1), we can readily construct a solution to (2).

Going the other way is easy. After all, if you have a pair $(a,b)$ that satisfies (2), it already satisfies (1). No transformation at all is necessary.

We have shown, then, that any solution to (1) readily leads to a solution for (2). And any solution to (2) readily leads to a solution to (1)---in fact, it already is a solution to (1). Thus the two formulations are equivalent. You will not lose points in that $[-1,+1]$ region if you solve (2) instead of (1).

Why do we care?

It's not just that (1) and (2) are equivalent. We actually prefer to solve (2) in practice. Remember, numerical solvers cannot guarantee that your inequalities are solved exactly. You have to be willing to tolerate a certain amount of numerical error.

When you pass model (1) to a solver, what you're really asking it to solve is something like this: $$a^Tx_i+b>-\epsilon ~~ \forall i=1,\dots, M, \quad a^Ty_i+b<+\epsilon~~\forall i=1,\dots, N$$ for some small, positive numerical tolerance $\epsilon$. The problem is that $(a,b)=(0,0)$ trivially satisfies these inequalities. So as far as the solver is concerned, $(0,0)$ is a perfectly acceptable solution to this model! Of course, for you, it is perfectly worthless.

Now, you can say "well, why not just prevent the solver from picking zero exactly?". Well, it's not just exact zero that is the problem---any value of $(a,b)$ that is sufficiently small will cause the inequalities to be trivially satisfied.

In contrast, consider model (2). When you account for numerical error, what it is really solving is this: $$a^Tx_i+b\geq 1-\epsilon ~~ \forall i=1,\dots, M, \quad a^Ty_i+b\leq -1+\epsilon~~\forall i=1,\dots, N$$ Now, as long as $\epsilon<1$, any solution that the numerical solver finds is still a valid solution to (1)! It may not satisfy $\leq 1$ and $\geq -1$ exactly, but it will satisfy $< 0$ and $> 0$, which is what you really wanted in the first place.

The bottom line: the normalization step that converts your model from (1) to (2) is not arbitrary at all, it has real numerical consequences.

And yes, you don't have to pick 1. You can pick 2, 0.5, whatever. You just have to pick something nonzero---well, try and pick something larger than $\epsilon$ and smaller than $1/\epsilon$, of course!