There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.
For the purposes of this question, I will use a fixed input vector containing 4 values.
Input vector
$$\left [ x_{0}, \quad x_{1}, \quad x_{2}, \quad x_{3}\right ]$$
Softmax Function and Derivative
My softmax function is defined as :
$$\left [ \frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, \quad \frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, \quad \frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, \quad \frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}\right ] $$
Since each element in the vector depends on all the values of the input vector, it makes sense that the gradients for each output element will contain some expression that contains all the input values.
My jacobian is this:
$$ \left[\begin{matrix}\frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - \frac{e^{2 x_{0}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{1}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}}\\- \frac{e^{x_{0}} e^{x_{1}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & \frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - \frac{e^{2 x_{1}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{1}} e^{x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{1}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}}\\- \frac{e^{x_{0}} e^{x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{1}} e^{x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & \frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - \frac{e^{2 x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{2}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}}\\- \frac{e^{x_{0}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{1}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{2}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & \frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - \frac{e^{2 x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}}\end{matrix}\right] $$
Each row contains the contribution from each output element. To calculate the 'final' derivative of each node , I sum up all the elements in each row, to get a vector which is the same size as my input vector.
Due to numerical stability issues, summing up the values gives unstable results. However, it is quite easy to reduce the sum of each row to this expression:
Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:
$$\sum_{i}{} softmax_{x_0} * softmax_{x_i} $$
and the first term is just $$ softmax_{x_0} $$
Which means the derivative of softmax is :
$$softmax - softmax^2$$
or
$$softmax(1-softmax)$$
This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.
Cross Entropy Loss and its derivative
The cross entropy takes in as input the softmax vector and a 'target' probability distribution.
$$\left [ t_{0}, \quad t_{1}, \quad t_{2}, \quad t_{3}\right ]$$
Let the softmax index at i be denoted as $s_i$ So the full softmax vector is :
$$\left [ s_{0}, \quad s_{1}, \quad s_{2}, \quad s_{3}\right ]$$
Cross entropy function
$$ - \sum_{i}^{classes} t_i log(s_i) $$
For our case it is
$$ - t_{0} \log{\left (\frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} \right )} - t_{1} \log{\left (\frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} \right )} - t_{2} \log{\left (\frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} \right )} - t_{3} \log{\left (\frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} \right )} $$
Derivative of cross entropy
Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:
$$ -\frac{t_i}{s_i} $$
Using chain rule to get derivative of softmax with cross entropy
We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:
$$ -\frac{t_i}{s_i} * s_i(1-s_i) $$
Simplifying , it gives
$$ -t_i *(1-s_i) $$
Analytically computing derivative of softmax with cross entropy
This document derives the derivative of softmax with cross entropy and it gets:
$$ s_i - t_i $$
Which is different from the one derived using chain rule.
Implementation using numpy
I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)
This is the code to evaluate:
x = np.array([-1.0, -1.0, 1.0]) # unscaled logits, my x vector
t = np.array([0.0,1.0,0.0]) # target probability distribution
## Function definitions
def softmax(v):
exps = np.exp(v)
sum = np.sum(exps)
return exps/sum
def cross_entropy(inps,targets):
return np.sum(-targets*np.log(inps))
def cross_entropy_derivatives(inps,targets):
return -targets/inps
def softmax_derivatives(softmax):
return softmax * (1-softmax)
soft = softmax(v) # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t) # 2.2395447662218846
cross_der = cross_entropy_derivatives(soft,t) # [-0. , -9.3890561, -0. ]
soft_der = softmax_derivatives(soft) # [0.09516324, 0.09516324, 0.16763901]
## Derivative using chain rule
cross_der * soft_der # [-0. , -0.89349302, -0. ]
## Derivative using analytical derivation
soft - t # [ 0.10650698, -0.89349302, 0.78698604]
Notice the difference in values.
My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.
There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.
Mistakes
1. Softmax Function and its derivative
I incorrectly stated that summing up the columns of the jacobian
is equivalent to doing
$$ \color{red}{softmax(1-softmax)} $$
The sum of the columns of the jacobian for $s_0$ actually goes like this:
$$ s_0 - \sum_{i}{} s_0 *s_i $$
Taking $s_0$ common:
$$ s_0 - s_0 \sum_{i}{} s_i $$
Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).
Therefore we get:
$$ s_0 - s_0*1 $$
which is $0$
So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.
2. Jacobians shouldn't be summed
The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.
This means that the equation
$$ \color{red}{-\frac{t_i}{s_i} * s_i(1-s_i)} $$
which calculates the derivative using chain rule.
is INCORRECT
It should actually be :
$$ -\frac{\mathbf{t}}{\mathbf{s}} \times Softmax\_Jacobian $$
where $\mathbf{t}$ and $\mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.
and the $\times$ symbol denotes matrix multiplication.
Why Summing up the partials result in 0
To understand that, we need to first understand what the jacobian matrix signifies.
For element 0,0 , it reads :
For element 1,0 , it reads like this:
For element 2,0 , it reads like this:
To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).
The same can be said about $x_2$ and $x_3$.
Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .
This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value. Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!
In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.
Since the change is 0, the gradient is 0
In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.
Implementation in numpy
Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)