Underflow while evaluating softmax, despite using exp-normalize

2.3k Views Asked by At

I have a very simple linear classifier that uses softmax (with exp-normalize trick) for the output:

h = self.A.dot(x)
z = self.B.dot(h)
nz = z - max(z)
e = np.exp(nz)
s = np.sum(e)
p = e / s
loss = -np.log(p[k])

I understand that softmax is given by: $\frac{e^{Z_i}}{\sum_{j=0}^{n} e^{z_j}}$

For numerical stability, the exp-normalize trick is used:

$\frac{e^{Z_i} - max(Z)}{\sum_{j=0}^{n} e^{z_j} - max(Z)}$

I believe that I understand the motivation for this, and how it eliminates underflow/overflow for the exponentiation.

The problem is, that in some cases, some elements of the numerator are very very small. Dividing this element by anything greater than 1.0 results in underflow, and I am not sure how to handle this.

For example, this is the specific case I am debugging (I've truncated most of the numbers to two decimal places for readability):

z = [10869.44 , 10837.85, 10851.28, 10136.48] nz = [0., -31.58, -18.15, -732.95]
e = [1.0, 1.91e-014, 1.30e-008, 4.80e-319] s = 1.000000013039

Dividing the last element of e, (4.80e-319) by s results in underflow.

I've looked at many questions on SE and lots of blogs that talk about the exp-normalize trick, but none of them seem to address how to achieve numerical stability during this division.

1

There are 1 best solutions below

0
On

The goal is to compute the target value $T$ given by $$ T = \log\left( \frac{\exp(z_i)}{\sum_{j=1}^n \exp(z_j)} \right)$$ without suffering a catastrophic floating point exception. This can be accomplished as follows. For any value $Z$, we have \begin{align} T &= z_i - \log\left( \sum_{j=1}^n \exp(z_j) \right) \\&= z_i - \log \left( \exp(Z) \sum_{j=1}^n \exp(z_j-Z)\right) \\&= z_i - Z - \log\left(\sum_{j=1}^n \exp(z_j-Z)\right) \end{align} The choice of $Z = \max z_j$ ensures that all terms $\exp(z_j-Z)$ are bounded by 1. Individual terms can certainly underflow to zero, but this is almost certainly irrelevant because $$S = \sum_{j=1}^n \exp(z_j-Z) \ge 1.$$ There are at least two corner cases where this approach can still fail. If all $z_j \approx Z$ and $n$ literally overflows, then sum $S \approx n$ is also likely to overflow. If the underflow threshold is large (quarter or half precision) and most $z_j$ are such that $\exp(z_j - Z)$ underflows, then the underflows can cause a large error. If low precision must be used, then more aggressive steps are needed, but I think this is another question.