Multiclass Classification: Why do we exponentiate the softmax function?

638 Views Asked by At

In the context of neural networks, we use the softmax output in multiclassification models. Firstly, let $P(y) = \sigma (z(2y-1))$, which comes from the definition of sigmoid units.

We define $\bf z=\bf W^\intercal \bf h+\bf b$, a linear layer predicting unnormalized log probabilities with $z_i = -\log P(y=i|x)$. The softmax of $z_i$ would result in

$$\text{softmax}(\textbf{z})_i = \frac{\exp(z_i)}{\sum_j\exp(z_j)}=\frac{e^{\log P(y=i|x)}}{\sum_j e^{\log P(y=j|x)}}$$

I understand the general idea behind the softmax output unit. For example, in a feedforward neural network, our data $\mathbb{X}$ would be linearly transformed in the first layer computing $z=\bf w^\intercal h+b$, to then pass to the output softmax layer.

This output layer would have $k$ units, one for each class, and would compute the log probability $z_i$ of each of them to evaluate in the softmax function.

I know

  • $a)$ $\forall z_i \space |\space \exp^{z_i} \in [1, e]$. In other words, the numerator of the softmax function will be within $1$ and $e$, and the denominator will be a sum of terms all between $1$ and $e$. Naturally, the probability of each $z_i$ will be normalized (they will add up to one).
  • If $z_i > z_j, \text{softmax}(z_i) > \text{softmax}(z_j)$. Therefore, if class $i$ has a greater dichotomized probability ($i$ versus all the rest of the clases) than class $j$, the softmax will preserve this relationship.

What I don't understand is why do we need to exponentiate at all. Both of the previous conditions (normalization, preserved order) would still hold if the softmax function was only of the form $\frac{(z_i)}{\sum_j z_j}$. Now, my intuition tells me it has something to do with the fact that the cost function of the softmax is, as usual, a negative log probability. Probably the exponential makes calculations in the cost function easier?

Thanks in advance.

2

There are 2 best solutions below

3
On BEST ANSWER

The softmax activation function has the nice property that it is translation invariant. The only thing that matters is the distances between the components in $\mathbf z$, not their particular values. For example, $\operatorname{softmax}(1,2)=\operatorname{softmax}(-1,0)$.

However, the softmax activation function is not scale invariant. This means you can't normalize $\mathbf z$ by scaling it. For example, $\operatorname{softmax}(1,2)\ne\operatorname{softmax}\left(\frac13,\frac23\right)$.

In contrast, $z_i/\sum\mathbf z$ is scale invariant but not translation invariant. As such, using $z_i/\sum\mathbf z$ will force everything leading up to $\mathbf z$ to have to produce values with greatly differing magnitudes. In the case of a linear layer, this would likely cause your parameters to go crazy, meaning you won't be able to train a reasonable classifier in this manner.

As far as computing the softmax activation function goes, the usual approaches are to subtract an amount from every component of $\mathbf z$, such as $\max(\mathbf z)$. This would ensure both underflow and underflow (where possible) are avoided during calculations.

Furthermore, one can simplify the softmax activation function as follows:

$$\operatorname{softmax}(\mathbf z)_i=\exp\left[z_i-\ln\left(\sum_je^{z_j}\right)\right]$$

The logarithm term is usually implemented as a "log-sum-exp" function, and is constant for every component $z_i$. By adjusting every component of $\mathbf z$ by this amount (let this be $\hat{\mathbf z}$), the softmax activation function simplifies even further:

$$\operatorname{softmax}(\mathbf z)_i=\operatorname{softmax}(\hat{\mathbf z})_i=\exp(\hat z_i)$$

Finally, it is worth noting two more points. $\operatorname{softmax}(\mathbf z)_i$ may actually underflow if a component is very very small (i.e. you get something like $10^{-400}$ which rounds to $0$). Additionally, when computing the gradient of your model, the logarithm of the softmax activation function must be computed. Coupling these two facts, you may run into a $\ln(0)$ case which should not occur. The true value of the logarithm of the softmax activation function is actually quite simple, based on the previous formula:

$$\ln(\operatorname{softmax}(\mathbf z)_i)=\hat z_i=z_i-\ln\left(\sum_je^{z_j}\right)$$

This formula is exact and has no issues with floating point precision. As such, the log-softmax activation function is typically used for the actual calculations, and it is only when results need to be interpreted as probabilities that it should be exponentiated. And since the softmax activation function is also scale invariant, we know that the largest component of the softmax activation function is the largest component of the log-softmax activation function.

0
On

You can also justify the use of softmax by connecting this with a multinomial logistic regression.

Indeed, assume you have $q+1$ classes, k=0,...q (number of output of final fully connected layer) and $p$ features (those can be the entry of your final fully connected layer), then performing a multinomial logistic regression means that we do a linear regression on the log odds (that range from $-\infty$ to $+\infty$), using the category 0 as reference category :

$$ \begin{aligned} \ln \frac{p_{1, i}}{p_{0, i}} &=\beta_{1,0}+\beta_{1,1} x_{i 1}+\ldots+\beta_{1, p} x_{i p}=\mathbf{x}_{i} \boldsymbol{\beta}_{1} \\ & \vdots \\ \ln \frac{p_{q, i}}{p_{0, i}} &=\beta_{q, 0}+\beta_{q, 1} x_{i 1}+\ldots+\beta_{q, p} x_{i p}=\mathbf{x}_{i} \boldsymbol{\beta}_{q} \end{aligned} $$

Rewriting gives the probabilities $$ \begin{aligned} p_{0, i} &=\frac{1}{1+\sum_{k=1}^{q} e^{\mathbf{x}_{i} \boldsymbol{\beta}_{k}}} \\ p_{1, i} &=\frac{e^{\mathbf{x}_{i} \boldsymbol{\beta}_{1}}}{1+\sum_{k=1}^{q} e^{\mathbf{x}_{i} \boldsymbol{\beta}_{k}}} \\ \vdots & \\ p_{q, i} &=\frac{e^{\mathbf{x}_{\boldsymbol{i}} \boldsymbol{\beta}_{q}}}{1+\sum_{k=1}^{q} e^{\mathbf{x}_{i} \boldsymbol{\beta}_{k}}} \end{aligned} $$ And here is your softmax function !

To connect this with your neural network, here each vector $\boldsymbol{\beta_k}$ corresponds to the weights of each of your neurons in your final fully connected layer.