Linear Classification - Decision Boundary

61 Views Asked by At

Currently, I'm trying to adapt my linear classification implementation to be able to classify multiple classes. The objective is to minimize $$ \arg\min\limits_{\mathbf{\Theta}} \frac{1}{2}\|\mathbf{X}\mathbf{\Theta} - \mathbf{Y}\|_2^2 + \frac{\lambda}{2}\|\mathbf{\Theta}\|_2^2 $$ where $\mathbf{X} = [\mathbf{x}_1,\ldots,\mathbf{x}_n]^T\in\mathbb{R}^{n \times d}$ is the feature matrix, $\mathbf{\Theta} = [\mathbf{\theta}_1,\ldots,\mathbf{\theta}_k]\in\mathbb{R}^{d\times k}$ is the matrix of parameter vectors that we want to estimate and $\mathbf{Y} = [\mathbf{y}_1,\ldots,\mathbf{y}_n]^T\in\{0,1\}^{n \times k}$ is the matrix of one-hot encoded class labels. Note: $n$ is the number of examples, $d$ the feature space dimension, i.e. $\mathbf{x}_i,\mathbf{\theta}_i\in\mathbb{R}^d$, and $k$ the number of classes.

So far so good. We can minimize this problem by the closed form solution or iterative methods as (stochastic) gradient descent, which all work just fine.

Suppose the class labels are $\{1,\ldots,k\}$, then the classification of an example $\mathbf{x}$ is done by choosing the corresponding class to the largest value in $\mathbf{\Theta}^T \mathbf{x}$, i.e. $\arg\max\limits_i [\mathbf{\Theta}^T \mathbf{x}]_i$

Edit

In order to understand my problem, consider the following example:
In the binary classification case, i.e. $k=2$, where the matrix $\mathbf{Y}$ could be written as a vector $\mathbf{y}\in\{0,1\}^n$ and $\mathbf{\Theta}$ could be written as a vector $\mathbf{\theta}\in\mathbb{R}^d$ as well, we have the objective $$ \arg\min\limits_{\mathbf{\theta}} \frac{1}{2}\|\mathbf{X}\mathbf{\theta} - \mathbf{y}\|_2^2 + \frac{\lambda}{2}\|\mathbf{\theta}\|_2^2 $$ and we can compute the decision boundary as:
$$ \mathbf{\theta}^T\mathbf{x} = 0, \quad \mathbf{x}\in\mathbb{R}^d $$ Therefore, with $d=2$ we have the decision boundary as $$ \mathbf{\theta}^T\mathbf{x} = 0 \quad \Leftrightarrow x_2 = -\frac{\theta_1}{\theta_2}x_1 $$ If we use the feature map $\phi(\mathbf{\tilde{x}}) = [1, \mathbf{x}]^T$ and $\mathbf{\theta}=[\theta_0, \theta_1, \theta_2]^T\in\mathbb{R}^3$ we are able to separate the classes with an affine decision boundary, i.e. $x_2 = ax_1 + b$, instead of $x_2 = ax_1$, we get $$ \mathbf{\theta}^T\mathbf{\tilde{x}} = \theta_0 + \theta_1x_1 + \theta_2x_2 = 0 \quad \Leftrightarrow x_2 = -\frac{\theta_0 + \theta_1x_1}{\theta_2} $$ This works just fine.

Here is the problem I have:
When I'm using one-hot encoding with $k=2$, i.e. $\mathbf{Y}\in\{0,1\}^{n \times 2}$ and the feature map $\phi(\mathbf{\tilde{x}}) = [1, \mathbf{x}]^T\in\mathbb{R}^3$, my objective function is the one at the top of this post.
Then I have two boundaries: $$ \mathbf{\theta}_1^T\mathbf{\tilde{x}} = 0 \quad \mathbf{\theta}_2^T\mathbf{\tilde{x}} = 0 $$ sice the vector $\mathbf{\theta}$ is a matrix $\mathbf{\Theta} = [\mathbf{\theta}_1, \mathbf{\theta}_2]\in\mathbb{R}^{3 \times 2}$. My understanding is that both should lie on top of each other. But the don't, as you can see in this example plot: calculated decision boundary

Here is a visualization of how the classifier classifies points that are drawn uniformly, by plotting them in different colors according to the classification: visualization of classifications which works better but not fine I guess?

So my question is:
How can I compute the decision boundary with one-hot encoding? What am I doing wrong?

I'd appreciate any help/hint! (:

Additional Edit: When I use the encoding $\mathbf{y} \in \{-1,1\}^n$ I can easily calculate the decision boundary as specified above. Here you can see the same plots as before with said encoding instead of one-hot-encoding: calculated decision boundary and visualization of classifications.

So, what's wrong with one-hot-encoding?