I am learning support vector regression but cannot fully understand the rational of the slack variable tricks in its formulation. The original optimization problem for SVR is as follows:
$\mathrm{min}\left\{C\sum_{i=1}^NL_\epsilon\left(y_i,w_0+\mathbf{w}^T\mathbf{x}_i\right)+\frac{1}{2}||\mathbf{w}||^2\right\}$
where $L_\epsilon\left(y_i,w_0+\mathbf{w}^T\mathbf{x}_i\right)=\mathrm{max}\left\{0,\big|y_i-\left(w_0+\mathbf{w}^T\mathbf{x}_i\right)\big|-\epsilon\right\}$ is the $\epsilon$-insensitive error function. Then all the papers and textbooks I read say to introduce two slack variables $\xi_i^+$ and $\xi_i^-$ such that the above problem transforms to:
$\mathrm{min}\left\{C\sum_{i=1}^N\left(\xi_i^++\xi_i^-\right)+\frac{1}{2}||\mathbf{w}||^2\right\}$ s.t. $\xi_i^+\geq0,\xi_i^-\geq0,\xi_i^++\epsilon\geq y_i-\left(w_0+\mathbf{w}^T\mathbf{x}_i\right)\geq-\xi_i^--\epsilon$
However, I just don't see the necessity to introduce two slack variables instead of one. In fact, if we simply let $\xi_i=L_\epsilon\left(y_i,w_0+\mathbf{w}^T\mathbf{x}_i\right)$, the original problem can be written as:
$\mathrm{min}\left\{C\sum_{i=1}^N\xi_i+\frac{1}{2}||\mathbf{w}||^2\right\}$ s.t. $\xi_i=\mathrm{max}\left\{0,\big|y_i-\left(w_0+\mathbf{w}^T\mathbf{x}_i\right)\big|-\epsilon\right\}$
The above problem is equivalent to
$\mathrm{min}\left\{C\sum_{i=1}^N\xi_i+\frac{1}{2}||\mathbf{w}||^2\right\}$ s.t. $\xi_i\geq0,\xi_i+\epsilon\geq y_i-\left(w_0+\mathbf{w}^T\mathbf{x}_i\right)\geq-\xi_i-\epsilon$
That is, we can just use one slack variable so as to write this in a standard quadratic programming form. Am I wrong? If not, why go all the way round to make two slack variables? Does it render any computational vantage or just for aid of interpretation of the concept?
I ran into the same question studying SVR, and even if this post is 2 years old maybe it can help others so here is an answer.
The slack variables in SVR are defined as such:
-> ξi+ is 0 if the training point is below the upper bound and positive if above
-> ξi- is 0 if the training point is above the lower bound and positive below
So you can see that the definitions are contradictory. If we used only 1 slack variable, say ξi+, if it was far below the lower bound, the value would still be 0. Look at the image below to convince yourself.
illustration of ξ+ and ξ-