Proving that softmax converges to argmax as we scale x

1.6k Views Asked by At

For a vector $\mathbb{x}$, the softmax function $S:\mathbb{R}^d\times \mathbb{R}\rightarrow \mathbb{R}^d$ is defined as

$$ S(x;c)_i = \frac{e^{c\cdot x_i}}{\sum_{k=1}^{d} e^{c\cdot x_k}} $$

Consider if we scale the softmax with constant $c$, $$ S(x;c)_i = \frac{e^{c\cdot x_i}}{\sum_{j=1}^{d} e^{c\cdot x_j}} $$

Now since $e^x$ is an increasing and diverging function, as $c$ grows, $S(x)$ will emphasize more and more the max value. At $c \rightarrow \infty$, $S(x)$ outputs a one-hot vector with 1 at the position of the maximum element. Now this is my intuition, but how do I prove this?

1

There are 1 best solutions below

2
On BEST ANSWER

Put $|\mathbb{x}| = n$ for convenience since this proof assumes $x \in\mathbb R^n$ for some $n$,

$$ S(x_i) = \frac{e^{x_i}}{\sum_{k=1}^{n} e^{x_k}} $$

scaling by constant $c$, $$ S(x_i, c) = \frac{e^{x_i (c)}}{\sum_{k=1}^{n} e^{x_k (c)}} $$

Let $\hat x = max(x_i)$ and divide multiply $S(x_i, c)$ by $\frac{e^{-\hat xc}}{e^{-\hat xc}}$:

$$\lim_{c \to \infty} S(x_i, c) = \lim_{c \to \infty} \frac{e^{-(\hat x -x_i) (c)}}{\sum_{k=1}^{n} e^{-(\hat x -x_k) (c)}}$$

Notice that $\Delta_i = \hat x -x_i > 0$ if $x_i \not = \hat x$ and $\hat \Delta = \hat x -x_i = 0$ if $x_i = \hat x$

$$\implies \lim_{c \to \infty} S(x_i, c) = \begin{cases} \lim_{c \to \infty} \frac{e^{-(\Delta_i) (c)}}{(\sum_{x_k \not = \hat x} e^{-(\Delta_k) (c)}) + 1} \text{, if $x_i \not = \hat x$}\\ \lim_{c \to \infty} \frac{1}{(\sum_{x_k \not = \hat x} e^{-(\Delta_k) (c)}) + 1} \text{, if $x_i = \hat x$} \end{cases} $$

$$\implies S(x_i, c) \to \begin{cases} \frac{0}{1} = 0 \text{, if $x_i \not = \hat x$}\\ \frac{1}{1} = 1 \text{, if $x_i = \hat x$} \end{cases}$$

as $c \to \infty$

Therefore, softmax $\to$ argmax as $x$ is scaled.