Derivative of Softmax without cross entropy

Question

Derivative of Softmax without cross entropy

7k Views Asked by Bumbble Comm At 10 May 2026 - 3:55

There are several resources that show how to find the derivatives of the softmax + cross_entropy loss together. However, I want to derive the derivatives separately.

For the purposes of this question, I will use a fixed input vector containing 4 values.

Input vector

$$\left [ x_{0}, \quad x_{1}, \quad x_{2}, \quad x_{3}\right ]$$

Softmax Function and Derivative

My softmax function is defined as :

$$\left [ \frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, \quad \frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, \quad \frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}, \quad \frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}}\right ] $$

Since each element in the vector depends on all the values of the input vector, it makes sense that the gradients for each output element will contain some expression that contains all the input values.

My jacobian is this:

$$ \left[\begin{matrix}\frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - \frac{e^{2 x_{0}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{1}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}}\\- \frac{e^{x_{0}} e^{x_{1}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & \frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - \frac{e^{2 x_{1}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{1}} e^{x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{1}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}}\\- \frac{e^{x_{0}} e^{x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{1}} e^{x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & \frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - \frac{e^{2 x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{2}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}}\\- \frac{e^{x_{0}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{1}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{2}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & \frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - \frac{e^{2 x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}}\end{matrix}\right] $$

Each row contains the contribution from each output element. To calculate the 'final' derivative of each node , I sum up all the elements in each row, to get a vector which is the same size as my input vector.

Due to numerical stability issues, summing up the values gives unstable results. However, it is quite easy to reduce the sum of each row to this expression:

Notice that except the first term (the only term that is positive) in each row, summing all the negative terms is equivalent to doing:

$$\sum_{i}{} softmax_{x_0} * softmax_{x_i} $$

and the first term is just $$ softmax_{x_0} $$

Which means the derivative of softmax is :

$$softmax - softmax^2$$

or

$$softmax(1-softmax)$$

This seems correct, and Geoff Hinton's video (at time 4:07) has this same solution. This answer also seems to get to the same equation as me.

Cross Entropy Loss and its derivative

The cross entropy takes in as input the softmax vector and a 'target' probability distribution.

$$\left [ t_{0}, \quad t_{1}, \quad t_{2}, \quad t_{3}\right ]$$

Let the softmax index at i be denoted as $s_i$ So the full softmax vector is :

$$\left [ s_{0}, \quad s_{1}, \quad s_{2}, \quad s_{3}\right ]$$

Cross entropy function

$$ - \sum_{i}^{classes} t_i log(s_i) $$

For our case it is

$$ - t_{0} \log{\left (\frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} \right )} - t_{1} \log{\left (\frac{e^{x_{1}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} \right )} - t_{2} \log{\left (\frac{e^{x_{2}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} \right )} - t_{3} \log{\left (\frac{e^{x_{3}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} \right )} $$

Derivative of cross entropy

Using the simple multiplication rule along with the log rule, the derivative of cross entropy is:

$$ -\frac{t_i}{s_i} $$

Using chain rule to get derivative of softmax with cross entropy

We can just multiply the cross entropy derivative (which calculates Loss with respect to softmax output) with the softmax derivative (which calculates Softmax with respect to input) to get:

$$ -\frac{t_i}{s_i} * s_i(1-s_i) $$

Simplifying , it gives

$$ -t_i *(1-s_i) $$

Analytically computing derivative of softmax with cross entropy

This document derives the derivative of softmax with cross entropy and it gets:

$$ s_i - t_i $$

Which is different from the one derived using chain rule.

Implementation using numpy

I thought perhaps both the derivatives would evaluate to the same result, and I had missed some simplification that could be applied using assumptions (e.g. probability distributions sum up to 1)

This is the code to evaluate:

x = np.array([-1.0, -1.0, 1.0])                 # unscaled logits, my x vector
t = np.array([0.0,1.0,0.0])                     # target probability distribution


## Function definitions

def softmax(v):
    exps = np.exp(v)
    sum  = np.sum(exps)
    return exps/sum

def cross_entropy(inps,targets):
    return np.sum(-targets*np.log(inps))

def cross_entropy_derivatives(inps,targets):
    return -targets/inps

def softmax_derivatives(softmax):
    return softmax  * (1-softmax)


soft = softmax(v)                               # [0.10650698, 0.10650698, 0.78698604]

cross_entropy(soft,t)                           # 2.2395447662218846

cross_der = cross_entropy_derivatives(soft,t)   # [-0.       , -9.3890561, -0.       ]

soft_der = softmax_derivatives(soft)            # [0.09516324, 0.09516324, 0.16763901]

## Derivative using chain rule  
cross_der * soft_der                            # [-0.        , -0.89349302, -0.        ]


## Derivative using analytical derivation 

soft - t                                        # [ 0.10650698, -0.89349302,  0.78698604]

Notice the difference in values.

My question, to clarify, is, what is the mistake that I am making. These two values should be quite similar.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

There are two very obvious and glaring errors in the derivation, which somewhat void the entire question. However, there are still key things that I learned while realising my mistakes that I would like to explain.

Mistakes

1. Softmax Function and its derivative

I incorrectly stated that summing up the columns of the jacobian

is equivalent to doing

$$ \color{red}{softmax(1-softmax)} $$

The sum of the columns of the jacobian for $s_0$ actually goes like this:

$$ s_0 - \sum_{i}{} s_0 *s_i $$

Taking $s_0$ common:

$$ s_0 - s_0 \sum_{i}{} s_i $$

Summation of all $s_i$ terms will equal 1 (since sum of softmax outputs is 1).

Therefore we get:

$$ s_0 - s_0*1 $$

which is $0$

So , if the partials are summed up , we get a 0. I will get back to why this makes sense later.

2. Jacobians shouldn't be summed

The jacobian matrix should not be summed and element-wise multiplied with the derivative of the previous error. Instead, a Matrix product should be done with the jacobian of the previous layer.

This means that the equation

$$ \color{red}{-\frac{t_i}{s_i} * s_i(1-s_i)} $$

which calculates the derivative using chain rule.

is INCORRECT

It should actually be :

$$ -\frac{\mathbf{t}}{\mathbf{s}} \times Softmax\_Jacobian $$

where $\mathbf{t}$ and $\mathbf{s}$ are vectors , and the _ symbol is the element wise division between them.

and the $\times$ symbol denotes matrix multiplication.

Why Summing up the partials result in 0

To understand that, we need to first understand what the jacobian matrix signifies.

For element 0,0 , it reads :

How does $s_0$ change when I change $x_0$

For element 1,0 , it reads like this:

How does $s_1$ change when I change $x_0$

For element 2,0 , it reads like this:

How does $s_2$ change when I change $x_0$

To get the total amount of change on $x_0$ , the above elements can be summed up (meaning we do a sum across rows ).

The same can be said about $x_2$ and $x_3$.

Just summing the columns up is equivalent to doing a matrix multiply between a vector of $1$s and the softmax jacobian .

This is means the jacobian would tell how much softmax would change if you changed all input values (i.e. all $x_i$) if you changed all of the $x_i$ by the same value. Since softmax is a normalising function, changing the values of all inputs by the same is equivalent to doing nothing!

In fact , the common "Normalising trick" done to stabilise softmax adds a constant to x_i without changing the values in any way.

Since the change is 0, the gradient is 0

In case of jacobian matrix multiply with the previous layer, there is different 'weights' assigned to each element in the jacobian, which will result in them not cancelling out.

Implementation in numpy

v = np.array([-1.0, -1.0, 1.0]) # unscaled logits
t = np.array([0.0,1.0,0.0])     # target probability distribution


def softmax(v):
    exps = np.exp(v)
    sum  = np.sum(exps)
    return exps/sum

def cross_entropy(inps,targets):
    return np.sum(-targets*np.log(inps))

def cross_entropy_derivatives(inps,targets):
    return -targets/inps

# Fixed softmax derivative which returns the jacobian instead
# see https://stackoverflow.com/questions/33541930/how-to-implement-the-softmax-derivative-independently-from-any-loss-function 

def softmax_derivatives(softmax):
    s = softmax.reshape(-1,1)
    return np.diagflat(s) - np.dot(s, s.T)



soft = softmax(v)           # [0.10650698, 0.10650698, 0.78698604]
cross_entropy(soft,t)       # 2.2395447662218846

cross_der = cross_entropy_derivatives(soft,t) 
                            # [-0.       , -9.3890561, -0.       ]

soft_der = softmax_derivatives(soft)

#[[ 0.09516324, -0.01134374, -0.08381951],
#[-0.01134374,  0.09516324, -0.08381951],
#[-0.08381951, -0.08381951,  0.16763901]]


# derivative using chain rule 
 cross_der  @ soft_der      # [[ 0.10650698, -0.89349302,  0.78698604]]



# Derivative using analytical derivation 
soft - t                    # [ 0.10650698, -0.89349302,  0.78698604]

Now the derivative using chain rule and the analytical derivative are similar (well within margin of floating point error)

Derivative of Softmax without cross entropy

Input vector

Softmax Function and Derivative

Cross Entropy Loss and its derivative

Cross entropy function

Derivative of cross entropy

Using chain rule to get derivative of softmax with cross entropy

Analytically computing derivative of softmax with cross entropy

Implementation using numpy

There are 1 best solutions below

Mistakes

1. Softmax Function and its derivative

2. Jacobians shouldn't be summed

Why Summing up the partials result in 0

Implementation in numpy

Related Questions in CALCULUS

Related Questions in MULTIVARIABLE-CALCULUS

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions