Extension of binary classification to multi-class classification

294 Views Asked by At

Multi-class classification is a generalization of logistic regression wherein we are dealing with binary classification. The latter problem is a setting where a number should be mapped to either $0$ or $1$. Hence, Logistic regression needs to convert the output of a neural network ($\hat{y}$) to to either $0$ or $1$ to decide. Therefore it use the sigmoid function defined as $$\sigma(\hat{y})=\frac{e^\hat{y}}{1+ e^\hat{y}}\tag{1}$$ where $\hat{y} \in \mathbb{R}$ is the output of the network.

On the other hand, Multi-class classification uses the softmax function to decide which is defined as

$$\text{softmax}(\hat{\textbf{y}})=\frac{e^{\hat{\textbf{y}}_i}}{\sum_{i=1}^{n}e^{\hat{\textbf{y}}_i}}\tag{2}$$ where $\hat{\textbf{y}} \in \mathbb{R}^n$ is the output of the network.

Question: How can we can play with $(1)$ to get $(2)$ algebraically or vice versa? If we start with $(1)$ how one can get rid of $1$ in denominator? or if we start with $(2)$ how we can generate $1$ in the denominator where $n=2$.

1

There are 1 best solutions below

0
On BEST ANSWER

softmax is a function from $\mathbb{R}^n \to \mathbb{R}^n$ and gives "probabilities" for each of $n$ alternatives.

$\sigma$ is a function from $\mathbb{R} \to \mathbb{R}$ and gives "probability" for one alternative. But of course, it is really choosing between two alternatives. It's just that, if there are only two alternatives, giving only one value is sufficient because probabilities must sum to $1$. I.e., if $\sigma(y)$ is the probability for one choice, then the other choice must have probability $1 - \sigma(y) = \sigma(-y)$.

Once you see that, then here is the "equivalence" (square brackets $[~]$ denote array or vector):

$$\begin{array}{} softmax( [y/2, -y/2] ) &=& \displaystyle [ {e^{y/2} \over e^{y/2} + e^{-y/2}}, {e^{-y/2} \over e^{-y/2} + e^{-y/2}} ]\\ &=& \displaystyle [ {e^y \over e^y + 1}, {e^{-y} \over e^{-y} + 1}] \\ &=& [ \sigma(y), \sigma(-y) ] \end{array} $$

So for $n=2$, softmax really is the same as $\sigma$, except with $y$ rescaled.