I'm learning soft margin support vector machines form this book. It's written that in soft margin SVMs, we allow minor errors in classifications to classify noisy/non-linear dataset or the dataset with outliers to correctly classify. To do this, the following constraint is introduced:
$$y_i({\bf w}\cdot {\bf x} + b) \geq 1 - \zeta$$
As $\zeta$ can be set to any larger number, we also need to add a penalty to optimization function to restrict the values of $\zeta$. Doing this will lead to the largest possible margin with minimum possible error (misclassifications). After penalizing the original SVM optimization function, it becomes:
$$\min_{{\bf w}, b, \zeta} \left(\frac{1}{2} {||{\bf w}||}^2 + C\sum_{i=0}^{m} \zeta_i \right)$$
Here $C$ is added to control the "softness" of the SVM. What I don't understand is how different values of C controls the so-called "softness"? In the book mentioned above and in this question, it's written that higher values of $C$ make the SVM act nearly the same as hard margin SVM and lower $C$ values makes the SVM more "softer" (allows more errors).
How this conclusion can be intuitively seen from the above equation? Choosing $C$ near to $0$ makes the above function more like hard margin SVM. So why soft margin SVM becomes hard margin when $C$ is $+\inf$ ?
EDIT
Here is the same question but I don't understand the answer.
With perfect separation, you require that $$ y_i({\bf w}\cdot {\bf x} + b) \geq 1 $$ So your $\xi_i$ are the deviation you allow from the above inequality. When $C$ is large, minimizing $\|w\|^2 + C \sum_{i=1}^n \xi_i$ means that $\xi_i$ will be small, since their sum has a large weight. When $C$ is small, it means that their sum has a small weight, and at the minimum $\xi_i$ may be larger, allowing more deviation from the above inequality.
When $C$ is extremely large, the only way to minimize the objective is to make the deviations extremely small, bringing the result close to hard margin SVM.
Elaboration
I see that there is some confusion - between the optimal value and the optimal solution. The optimal value is the minimal value of the objective function. The optimal solution are the actual variables (in your case $\bf w$ and $\bf \xi$). The optimal value may become large when $C$ goes to infinity, but you did not ask about the optimal value at all!
Now, let us go a bit abstract. Assume you are solving an optimization problem of the form $$ \min_{{\bf x}, {\bf y}} ~ \alpha f({\bf x}) + \beta g({\bf y}) \quad \text{s.t.} \quad ({\bf x}, {\bf y}) \in D, $$ where $\alpha, \beta > 0$ are some constants. To make the objective as small as possible, we need to somehow balance $f$ and $g$: choosing $\bf x$ such that $f$ is small might constrain us to choose $\bf y$ such that $g$ becomes larger, and vice versa.
If $\alpha$ is much larger then $\beta$, then it is 'more beneficial' to make $f$ small, at the expense of making $g$ a bit larger. The same holds the other way around.
In your case you have two functions $\|{\bf w}\|^2$ and $\sum_{i=1}^n \xi_i$, and $\alpha = 1$, $\beta = C$. If $C$ is much smaller then $1$, then it is `beneficial' to make the norm of $\bf w$ small. If $C$ is much larger then $1$ then it is the other way around.
It turns out that $\sum_{i=1}^n \xi_i$, since $\xi \geq 0$, happens to be exactly $\|{\bf \xi}\|_1$, meaning that the entries $\xi_i$ become small. Moreover, it is well-known that attempting to minimize the $\ell_1$ norm promotes sparsity (just Google it), meaning that as $C$ increases, more and more entries of $\xi$ become zero.