Two questions about the derivative of Softmax function.

418 Views Asked by At

Actually i have some problems with the derivative of softmax:

$$y_k = \frac{e^{a_k}}{\sum_{i=0}^K e^{a_i}}$$

The first think i want to know is why the derivative of $\frac{\partial (\sum_{i=0}^K e^{a_i}) }{\partial e^k} = e^{a_k}$?, why the indice of $e^a$ change?

The second question is why the equation have two answers?, i know how to get the first answer, but the second is a little bit confuse for me.

I appreciate if you know about some lecture, or some property that i actually missing in my lectures.

Thanks.

1

There are 1 best solutions below

0
On

Note that the softmax function takes a vector and produces a vector of equal size. Therefore its "derivative" will be a Jacobian matrix containing its partial derivatives. If the vectors softmax operates on has $n$ elements, then the Jacobian will be of size $n \times n$ and contain $n^2$ partial derivatives.

The easier way (I think) to understand what happens is to work on vectors of size two and generalize from that. So let softmax be $$ S([x, y]) = [S_x(x), S_y(y)]= \left[\frac{e^x}{e^x+e^y}, \frac{e^y}{e^x+e^y}\right]. $$ The Jacobian for $S$ will contain 4 partial derivatives arranged in the following fashion: $$ JS([x,y]) = \begin{bmatrix} \frac{\partial S_x}{\partial x} \frac{\partial S_x}{\partial y}\\ \frac{\partial S_y}{\partial x} \frac{\partial S_y}{\partial y} \end{bmatrix}. $$ Calculating gives $$ \frac{\partial S_x}{\partial x} = \frac{\partial}{\partial x}\frac{e^x}{e^x+e^y} = \frac{e^x(e^x+e^y) - e^{2x}}{(e^x+e^y)^2} = \frac{e^x}{e^x+e^y}\frac{e^y}{e^x+e^y} = S_x(x)S_y(y). $$ Note that $S_y(y) = 1 - S_x(x)$ so it is more general to write the derivative as $S_x(x)(1 - S_x(x))$ because the formula works for vectors with more than two components. We calculate another derivative: $$ \frac{\partial S_x}{\partial y} = \frac{\partial}{\partial y}\frac{e^x}{e^x+e^y} = \frac{-e^xe^y}{(e^x+e^y)^2} = -S_x(x)S_y(y) $$ As you can imagine, the partial derivatives are symmetric so we can fill in the full Jacobian. $$ JS([x,y]) = \begin{bmatrix} S_x(x)(1-S_x(x)) & -S_x(x)S_y(y)\\ -S_y(y)S_x(x) & S_y(y)(1-S_y(y)) \end{bmatrix}. $$ There are two different "types" of elements depending on whether they are on the diagonal or not. For your first question, just note that $$ \frac{\partial}{\partial x}(e^x + e^y) = e^x. $$