I am trying to correctly understand the derivative of the softmax-function so that I can implement it correctly. I already know that the derived formula looks like tbis:
$\frac{\delta p_i}{\delta a_j} = p_i*(1-p_j)$ if i=j
and $-p_j*p_i$ else.
What I don't get is: What exactly are i and j and how do I get them from my input-vector?
Could anyone please explain this?
The Softmax function maps an $n$-dimensional ($n \ge 2$) vector of reals, $\mathbf{z}$, $$\bbox{ z_i \in \mathbb{R} , \quad i = 1 .. n }$$ to another $n$-dimensional real vector $\mathbf{p}$ with all components within $0$ and $1$, $$\bbox{ 0 \le p_i \le 1, \quad i = 1 .. n }$$ and the sum of components 1, $$\bbox{ \sum_{k=1}^n p_k = 1 }$$ The Softmax function itself is defined component-wise, $$\bbox{ p_i = \frac{e^{z_i}}{\sum_{k=1}^n e^{z_k}}, \quad i = 1 .. n }$$ Its derivative turns out to be simple. The partial derivative of the $i$'th component of $\mathbf{p}$, with respect to the $j$'th dimension, is $$\bbox{ \frac{\partial p_i}{\partial z_j} = \begin{cases} p_i ( 1 - p_i ), & i = j \\ - p_i p_j, & i \ne j \\ \end{cases} }$$ In other words, $i$ and $j$ specify dimensions, and identify the components in the vector. In numerical computation, they are essentially indexes to the vector.
It may be useful to look at the Jacobian matrix of the Softmax function on $\mathbf{z}$. Each row corresponds to a component of the derivative, $\mathbf{p}$, and each column corresponds to a dimension of $\mathbf{z}$ the partial derivative is taken with respect to. In other words, $$\bbox{\mathbf{J}_{i j} = \frac{ \partial p_i }{\partial z_i}, \quad i, j = 1 .. n }$$ $$\bbox{\mathbf{J} = \left [ \begin{matrix} p_1 ( 1 - p_1 ) & - p_1 p_2 & \dots & - p_1 p_{n-1} & -p_1 p_n \\ -p_2 p_1 & p_2 ( 1 - p_2 ) & \dots & -p_2 p_{n-1} & -p_2 p_n \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ -p_{n-1} p_1 & -p_{n-1} p_2 & \dots & p_{n-1} ( 1 - p_{n-1} & -p_{n-1} p_n \\ -p_n p_1 & -p_n p_2 & \dots & -p_n p_{n-1} & p_n ( 1 - p_n ) \\ \end{matrix} \right ]}$$ Because $p_i \in \mathbb{R}$, $\mathbf{J}$ is symmetric, $$\bbox{ \mathbf{J}_{i j} = \mathbf{J}_{j i}, \quad i , j = 1 .. n }$$
Programmers are often more comfortable with pseudocode examples:
We can calculate the entire Jacobian matrix of $\mathbf{z}$ using
or, using $\mathbf{p} = \text{Softmax}(\mathbf{z})$ if that is already calculated,
Individual partial derivatives are calculated using
but note using the matrix symmetry, we can calculate the diagonal values separately, and off-diagonal values (to the left, and directly above, each diagonal value) in a separate subloop, we can do the calculations much more efficiently.
Whether calculating the Jacobian matrix is useful or not, depends on what you need the partial derivatives for. If you only need the gradient of $\mathbf{p}$ ($\nabla\mathbf{p}$), use
instead of
Jacobian_matrix_of_Softmaxed(). Note how the latter version, if you don't need $\text{Softmax}(\mathbf{z})$, can reuse the storage for the gradient.