I'm currently studying machine learning with the book Pattern Recognition and Machine Learning (Bishop, 2006) and had a question regarding finding the distance between the origin and a linear discriminant function. For anyone curious, this is from Chapter 4.1: Discriminant Functions.
The book starts by giving a linear discriminant function in the typical form of:
$$y(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + w_0$$
with $\mathbf{x}$ being the input vector, $\mathbf{w}$ the weight vector, and $w_0$ the bias term.
The particular portion of the book I'm having trouble understanding is:
... if $\mathbf{x}$ is a point on the decision surface, then $y(\mathbf{x}) = 0$, and so the normal distance form the origin to the decision surface is given by
$$\frac{\mathbf{w}^T\mathbf{x}}{\Vert \mathbf{w} \Vert} = -\frac{w_0}{\Vert \mathbf{w} \Vert}$$
We therefore see that the bias parameter $w_0$ determines the location of the decision surface.
The reason that I'm having trouble understanding this is perhaps my lack of understanding of fundamental algebra, but my recollection of the distance between a line $ax + by + c = 0$ and a point $(x_0, y_0)$ is:
$$d = \frac{| ax_0 + by_0 + c |}{\sqrt{a^2 + b^2}}$$
and so plugging in the values appropriately would give us:
$$d = \frac{|\mathbf{w}x_0 - y_0 + w_0 |}{\sqrt{\mathbf{w}^T\mathbf{w} + 1}}$$
assuming that $\mathbf{x} = (x_0, y_0)$.
Judging from the equation in the highlighted block, it seems that the distance from the origin to the line is $\mathbf{w}^T \mathbf{x}$ and the "normalized" distance is dividing by $\Vert \mathbf{w} \Vert$. This is also a bit unclear to me as to why we would choose the weight vector to normalize by.
I suppose my question could be summarized into:
- How was the distance equation derived? Am I thinking too one-dimensionally with the distance equation I used?
- Why did we choose to normalize by the weight vector?
Any tips or feedback are appreciated. Thanks in advance.
We should compute only the distance in the feature space and not including the prediction value.
Let's compute the distance between the origin and the hyperplane $w^Tx+w_0=0$.
Hence the distance is $$\frac{|w^T0-w_0|}{\|w\|}=\frac{|w_0|}{\|w\|}.$$
The division of $\|w\|$ is due to we are normalizing the normal direction.