Gradient of a softmax applied on a linear function

Question

Gradient of a softmax applied on a linear function

2.4k Views Asked by Bumbble Comm At 29 Mar 2026 - 9:35

I am trying to calculate the softmax gradient: $$p_j=[f(\vec{x})]_j = \frac{e^{W_jx+b_j}}{\sum_k e^{W_kx+b_k}}$$ With the cross-entropy error: $$L = -\sum_j y_j \log p_j$$ Using this question I get that $$\frac{\partial L}{\partial o_i} = p_i - y_i$$ Where $o_i=W_ix+b_i$

So, by applying the chain rule I get to: $$\frac{\partial L}{\partial b_i}=\frac{\partial L}{\partial o_i}\frac{\partial o_i}{\partial b_i} = (p_i - y_i)1=p_i - y_i$$ Which makes sense (dimensionality wise) $$\frac{\partial L}{\partial W_i}=\frac{\partial L}{\partial o_i}\frac{\partial o_i}{\partial W_i} = (p_i - y_i)\vec{x}$$ Which has a dimensionality mismatch

(for example if dimensions are $W_{3\times 4},\vec{b}_4,\vec{x}_3$)

What am I doing wrong ? and what is the correct gradient ?

Original Q&A

There are 2 best solutions below

Bumbble Comm On 17 Dec 2016 - 12:59

You can use differentials to tackle the problem.

Define the auxiliary variables $$\eqalign { o &= Wx+b \cr e &= \exp(o) \cr p &= \frac{e}{1:e} \cr }$$ with their corresponding differentials $$\eqalign { do &= dW\,x + db \cr de &= e\odot do \cr dp &= \frac{de}{1:e} - \frac{e(1:de)}{(1:e)^2} \,\,\,\,=\,\, (P-pp^T)\,do \cr }$$where : denotes the double-dot (aka Frobenius) product, and $\odot$ denotes the element-wise (aka Hadamard) product, and $P = \operatorname{Diag}(p)$.

Now substitute these into the cross-entropy function, and find its differential $$\eqalign { L &= -y:\log(p) \cr\cr dL &= -y:d\log(p) \cr &= -y:P^{-1}dp \cr &= -y:P^{-1}(P-pp^T)\,do \cr &= -y:(I-1p^T)\,do \cr &= (p1^T-I)y:(dW\,x + db) \cr &= (p1^T-I)yx^T:dW + (p1^T-I)y:db \cr\cr }$$ Setting $db=0$ yields the gradient wrt $W$ $$\eqalign { \frac{\partial L}{\partial W} &= (p1^T-I)yx^T \cr &= (p-y)x^T \cr }$$ while setting $dW=0$ yields the gradient wrt $b$ $$\eqalign { \frac{\partial L}{\partial b} &= (p1^T-I)y \cr &= p-y \cr }$$ Note that in the above derivation, the $\log$ and $\exp$ functions are applied element-wise to their vector arguments.

Based on your expected results, you appear to use an unstated constraint that $1^Ty=1$, which I have used to simplify the final results.

**Bumbble Comm** · Accepted Answer

The dimension mismatch appears when you are using the chain rule. In case of taking the derivative with respect to $W_i$ (which denotes the $i$-th row of $W$, right?), we have maps $$ W_i \in \mathbf R^{1 \times k} \mapsto o_i = W_ix+b_i \in \mathbf R \mapsto L \in \mathbf R $$ hence a function $\mathbf R^{1 \times k} \to \mathbf R$, therefore the derivative is a map $$ \mathbf R^{1 \times k} \to L(\mathbf R^{1 \times k}, \mathbf R)$$ which assigns to each point $W_i \in \mathbf R^{1 \times k}$ a linear map $\mathbf R^{1 \times k} \to \mathbf R$. The chain rule tells us that for $h \in \mathbf R^{1 \times k}$, we have $$ \def\pd#1#2{\frac{\partial #1}{\partial #2}}\pd{L}{W_i}h = \pd{L}{o_i}\cdot \pd{o_i}{W_i}h $$ Now, as $W_i \mapsto o_i$ is affine, the derivative at any point equals the linear part, that is $$ \pd{o_i}{W_i} = hx, \qquad h \in \mathbf R^{1 \times k} $$ Therefore $$ \pd L{W_i}h = (p_i - y_i)hx $$ that is $\pd{L}{W_i}$ is the linear map $$ \mathbf R^{1 \times k} \ni h \mapsto (p_i - y_i)hx $$

Gradient of a softmax applied on a linear function

There are 2 best solutions below

Related Questions in LINEAR-ALGEBRA

Related Questions in DERIVATIVES

Related Questions in PARTIAL-DERIVATIVE

Related Questions in GRADIENT-DESCENT

Trending Questions

Popular # Hahtags

Popular Questions