Underflow while evaluating softmax, despite using exp-normalize

2.3k Views Asked by Bumbble Comm At 15 May 2026 - 8:36

I have a very simple linear classifier that uses softmax (with exp-normalize trick) for the output:

h = self.A.dot(x)
z = self.B.dot(h)
nz = z - max(z)
e = np.exp(nz)
s = np.sum(e)
p = e / s
loss = -np.log(p[k])

I understand that softmax is given by: $\frac{e^{Z_i}}{\sum_{j=0}^{n} e^{z_j}}$

For numerical stability, the exp-normalize trick is used:

$\frac{e^{Z_i} - max(Z)}{\sum_{j=0}^{n} e^{z_j} - max(Z)}$

I believe that I understand the motivation for this, and how it eliminates underflow/overflow for the exponentiation.

The problem is, that in some cases, some elements of the numerator are very very small. Dividing this element by anything greater than 1.0 results in underflow, and I am not sure how to handle this.

For example, this is the specific case I am debugging (I've truncated most of the numbers to two decimal places for readability):

z = [10869.44 , 10837.85, 10851.28, 10136.48] nz = [0., -31.58, -18.15, -732.95]
e = [1.0, 1.91e-014, 1.30e-008, 4.80e-319] s = 1.000000013039

Dividing the last element of e, (4.80e-319) by s results in underflow.

I've looked at many questions on SE and lots of blogs that talk about the exp-normalize trick, but none of them seem to address how to achieve numerical stability during this division.

Original Q&A

There are 1 best solutions below

Bumbble Comm On 31 Jan 2019 - 3:51

The goal is to compute the target value $T$ given by $$ T = \log\left( \frac{\exp(z_i)}{\sum_{j=1}^n \exp(z_j)} \right)$$ without suffering a catastrophic floating point exception. This can be accomplished as follows. For any value $Z$, we have \begin{align} T &= z_i - \log\left( \sum_{j=1}^n \exp(z_j) \right) \\&= z_i - \log \left( \exp(Z) \sum_{j=1}^n \exp(z_j-Z)\right) \\&= z_i - Z - \log\left(\sum_{j=1}^n \exp(z_j-Z)\right) \end{align} The choice of $Z = \max z_j$ ensures that all terms $\exp(z_j-Z)$ are bounded by 1. Individual terms can certainly underflow to zero, but this is almost certainly irrelevant because $$S = \sum_{j=1}^n \exp(z_j-Z) \ge 1.$$ There are at least two corner cases where this approach can still fail. If all $z_j \approx Z$ and $n$ literally overflows, then sum $S \approx n$ is also likely to overflow. If the underflow threshold is large (quarter or half precision) and most $z_j$ are such that $\exp(z_j - Z)$ underflows, then the underflows can cause a large error. If low precision must be used, then more aggressive steps are needed, but I think this is another question.

Underflow while evaluating softmax, despite using exp-normalize

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in NUMERICAL-METHODS

Related Questions in COMPUTER-SCIENCE

Related Questions in NUMERICAL-LINEAR-ALGEBRA

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions