I am reading a book about support vector machine, and I don't understand some of the math in it.
Consider the training sample ${(x_{i}, d_{i})}^{N}_{i=1}$ where $x_{i}$ is the input pattern for the ith example and $d_{i}$ is the corresponding desired response.
[...]
Let $w_{0}$ and $b_{0}$ denote the optimum values of the weight vector and bias, respectively. Correspondingly, the optimal hyperplane, representing a multidimensional linear decision surface in the input space, is defined by $$w^{T}_{0} x + b_{0} = 0 $$
The discriminant function $$g(x) = w^{T}_{0} x + b_{0}$$ gives an algebraic mesure of the distance from $x$ to the optimal hyperplane.
We can express $x$ as $$x = x_{p} + r \frac{w_{0}}{||w_{0}||}$$ where $x_{p}$ is the normal projection of $x$ onto the optimal hyperplane and $r$ is the desired algebraic distance.
Since, by definition, $g(x_{p}) = 0$, it follows that $$g(x) = w^{T}_{0} x + b_{0} = r||w_{0}||$$
From: Neural Networks and Learning Machines (3rd Edition) p 270
Why can we express x as $x = x_{p} + r \frac{w_{0}}{||w_{0}||}$ ?
Why does $g(x) = r||w_{0}||$ ?
I wonder how I can represent this hyperplane in two dimensions.
At first I thought that the equation $w^{T}_{0} x + b_{0} = 0 $ would be equivalent to a linear function $ax + b$ but I am not quite sure as in a linear function a is a scalar but in my case a would be $w_{0}$ which is a vector.
$$g(x) = g(x_p + \frac{w_0 r}{||w_0||}) = w_0^T (x_p + \frac{ w_0 r }{||w_0||}) + b_0 = w_0^T x_p + w_0^T \frac{ w_0 r }{||w_0||} + b_0$$
Now observe that $w_0^T x_p + b_0 = g(x_p) = 0$ by construction of $x_p$.
And $$w_0^T w_0 = <w_0,w_0> = ||w_0||^2$$
Hence :
$$g(x) = w_0^T x_p + b_0 + \frac{||w_0||^2r}{||w_0||} = 0 + r||w_0||=r||w_0||$$
About hyperplanes, an hyperplane in a $n$-dimensional space is just a subspace of dimension $n-1$. So it's indeed a line in $\mathbb{R^2}$, an usual "plane" in $\mathbb{R^3}$, and so on...
Apart from the calculus, it just expresses that if you decompose your $x$ between a vector that lies in $P$ (the distance from which $g$ computes) and an other that lies in the orthogonal of $P$, then the distance is totally given by this orthogonal part.
Why can you always do this decomposition ? On a simple example (in $\mathbb{R^2}$ with $b_0=0$) you can project any vector $x$ on a line with a directing vector $u$, just using scalar product : $x_p = <x,u>u$. Then you can check that $x = x_p + (x - x_p)$ and that $<x-x_p,u>=0$ indeed. For affine space, this is essentially the same, you just need to manage the constant. You can see it quite easily on a picture (here a projection on a plane) :
http://www.math4all.in/public_html/linear%20algebra/images/recta8.1.jpg