Values in Softmax Derivative

218 Views Asked by At

I am trying to correctly understand the derivative of the softmax-function so that I can implement it correctly. I already know that the derived formula looks like tbis:

$\frac{\delta p_i}{\delta a_j} = p_i*(1-p_j)$ if i=j

and $-p_j*p_i$ else.

What I don't get is: What exactly are i and j and how do I get them from my input-vector?

Could anyone please explain this?

1

There are 1 best solutions below

0
On

The Softmax function maps an $n$-dimensional ($n \ge 2$) vector of reals, $\mathbf{z}$, $$\bbox{ z_i \in \mathbb{R} , \quad i = 1 .. n }$$ to another $n$-dimensional real vector $\mathbf{p}$ with all components within $0$ and $1$, $$\bbox{ 0 \le p_i \le 1, \quad i = 1 .. n }$$ and the sum of components 1, $$\bbox{ \sum_{k=1}^n p_k = 1 }$$ The Softmax function itself is defined component-wise, $$\bbox{ p_i = \frac{e^{z_i}}{\sum_{k=1}^n e^{z_k}}, \quad i = 1 .. n }$$ Its derivative turns out to be simple. The partial derivative of the $i$'th component of $\mathbf{p}$, with respect to the $j$'th dimension, is $$\bbox{ \frac{\partial p_i}{\partial z_j} = \begin{cases} p_i ( 1 - p_i ), & i = j \\ - p_i p_j, & i \ne j \\ \end{cases} }$$ In other words, $i$ and $j$ specify dimensions, and identify the components in the vector. In numerical computation, they are essentially indexes to the vector.

It may be useful to look at the Jacobian matrix of the Softmax function on $\mathbf{z}$. Each row corresponds to a component of the derivative, $\mathbf{p}$, and each column corresponds to a dimension of $\mathbf{z}$ the partial derivative is taken with respect to. In other words, $$\bbox{\mathbf{J}_{i j} = \frac{ \partial p_i }{\partial z_i}, \quad i, j = 1 .. n }$$ $$\bbox{\mathbf{J} = \left [ \begin{matrix} p_1 ( 1 - p_1 ) & - p_1 p_2 & \dots & - p_1 p_{n-1} & -p_1 p_n \\ -p_2 p_1 & p_2 ( 1 - p_2 ) & \dots & -p_2 p_{n-1} & -p_2 p_n \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ -p_{n-1} p_1 & -p_{n-1} p_2 & \dots & p_{n-1} ( 1 - p_{n-1} & -p_{n-1} p_n \\ -p_n p_1 & -p_n p_2 & \dots & -p_n p_{n-1} & p_n ( 1 - p_n ) \\ \end{matrix} \right ]}$$ Because $p_i \in \mathbb{R}$, $\mathbf{J}$ is symmetric, $$\bbox{ \mathbf{J}_{i j} = \mathbf{J}_{j i}, \quad i , j = 1 .. n }$$


Programmers are often more comfortable with pseudocode examples:

Function Softmax(z[], n):
    Let  p[]  be an array of n reals
    Let  d = 0.0

    # Calculate exponents; unscaled components
    For i = 1 to n:
        p[i] = Exp(z[i])
        d = d + p[i]
    End For

    # Normalize components
    For i = 1 to n:
        p[i] = p[i] / d
    End For

    Return p
End Function

We can calculate the entire Jacobian matrix of $\mathbf{z}$ using

Function Jacobian_matrix_of_Softmax(z[], n):
    Let p = Softmax(z)
    Let J[][] = n by n matrix of reals
    For i = 1 to n:
        J[i][i] = p[i] * (1.0 - p[i])
        For k = 1 to i-1:
            J[k][i] = -p[i]*p[k]
            J[i][k] = J[k][i] 
        End For
    End For
    Discard p
    Return J
End Function

or, using $\mathbf{p} = \text{Softmax}(\mathbf{z})$ if that is already calculated,

Function Jacobian_matrix_of_Softmaxed(p[], n):
    Let J[][] = n by n matrix of reals
    For i = 1 to n:
        J[i][i] = p[i] * (1.0 - p[i])
        For k = 1 to i-1:
            J[k][i] = -p[i]*p[k]
            J[i][k] = J[k][i] 
        End For
    End For
    Discard p
    Return J
End Function

Individual partial derivatives are calculated using

Function Partial_of_Softmaxed(p[], i, j):
    If i == j:
        Return p[i] * (1.0 - p[i])
    Else:
        Return -p[i] * p[j] 
    End If
End Function

but note using the matrix symmetry, we can calculate the diagonal values separately, and off-diagonal values (to the left, and directly above, each diagonal value) in a separate subloop, we can do the calculations much more efficiently.

Whether calculating the Jacobian matrix is useful or not, depends on what you need the partial derivatives for. If you only need the gradient of $\mathbf{p}$ ($\nabla\mathbf{p}$), use

Function Gradient_of_Softmaxed(p[], n):
    Let  g[] be an array of n reals
    For i = 1 to n:
        g[i] = p[i] * (1.0 - p[i])
    End For
    Return g
End Function

Function Gradient_of_Softmax(z[], n):
    Let g = Softmax(z, n)
    For i = 1 to n:
        g[i] = g[i] * (1.0 - g[i])
    End For
    Return g
End Function

instead of Jacobian_matrix_of_Softmaxed(). Note how the latter version, if you don't need $\text{Softmax}(\mathbf{z})$, can reuse the storage for the gradient.