I'm learning about "linear discriminant analysis" on "Statistical Pattern Recognition" of A.R. Webb and K.D. Copsey (chapter 5 of 3rd edition).
The general idea is introduced where we suppose to have a set of training patterns (vectors) $x_1, ..., x_n$, each of which is assigned to one of two classes, $\omega_1$ or $\omega_2$.
We seek a weight vector $w$ and a threshold $w_0$ such that:
$w^Tx + w_0 > 0 \Rightarrow x \in \omega_1$
$w^Tx + w_0 < 0 \Rightarrow x \in \omega_2$
The decision surface (the boundary separating region of $\omega_1$ from region of $\omega_2$) is the hyperplane represented by the equation
$g(x) = w^Tx + w_0 = 0$
So far the introduction is clear.
Next, the authors go on saying that this hyperplane
has unit normal in the direction of $w$, and a perpendicular distance $|w_0|/|w|$ from the origin.
The distance of a pattern $x$ to the decision hyperplane is given by $|r|$, where
$r = g(x)/|w| = (w^Tx+w_0)/|w|$
with the sign of $r$ indicating on which side of the decision hyperplane the pattern lies.
No explanation is given on these results, so I'm wondering why the hyperplane normal is parallel to $w$, why the distance from the origin is that one and why the distance of a pattern from the hyperplane is that one.
Could you please give me some insights on how to get to these results myself?
Thanks,
Domenico
Do you know scalar product? The scalar products of vectors $a=(a_1,\dots,a_n)^T$ and $b=(b_1,\dots,b_n)^T$ is defined by $$\langle a,b\rangle:=a_1b_1+\dots+a_nb_n$$ i.e., $\langle a,b\rangle=a^Tb$ expressed by matrix multiplication.
A basic property of scalar product is that $\ \langle a,b\rangle=0 \iff a\perp b$.
Let $H:=\{x\mid g(x)=0\}$. If $h,k\in H$ then we have $$w^Th+w_0=w^Tk+w_0=0 \ \implies\ w^T(h-k)=0$$ so that $w\perp(h-k)$, proving that $w\perp H$.
Now measure the distance of $H$ from the origin: a perpendicular line is $\{\lambda w\mid\lambda\in\Bbb R\}$, let's find its intersection with $H$: $\lambda w\in H$ iff $$\begin{aligned} w^T(\lambda w)+w_0 &=0 \\ \lambda|w|^2 &=-w_0\\ \lambda &=-\frac{w_0}{|w|^2} \end{aligned}$$ where we used $w^Tw=\langle w,w\rangle=|w|^2$. So that, with this $\lambda$, we have $\lambda w\in H$ whose length is $|\lambda|\cdot|w|=\displaystyle\frac{|w_0|}{|w|}$.
If we fix any point $h_0$ in $H$ (say, $h_0:=\lambda w$ with the previously found $\lambda$), then $w^Th_0=-w_0$ and the function $g(x)=w^Tx+w_0$ can be rewritten as $$g(x)=w^T(x-h_0)$$ which is exactly the scalar product $\langle w,\ (x-h_0)\rangle$. If $x-h_0$ is is composed of orthogonal and parallel parts to $w$, $\ x-h_0=u+\lambda w$ with $u\perp w$, then $|\lambda w|=|\lambda|\cdot|w|$ will measure the disatance of $x$ from $H$. Multiply by $w^T$ from the left to find $\lambda$: $$g(x)=w^T(x-h_0)=w^Tu+\lambda w^Tw=\lambda|w|^2$$ so $\lambda=\displaystyle\frac{g(x)}{|w|^2}$ and the result follows.