I am trying to calculate the softmax gradient: $$p_j=[f(\vec{x})]_j = \frac{e^{W_jx+b_j}}{\sum_k e^{W_kx+b_k}}$$ With the cross-entropy error: $$L = -\sum_j y_j \log p_j$$ Using this question I get that $$\frac{\partial L}{\partial o_i} = p_i - y_i$$ Where $o_i=W_ix+b_i$
So, by applying the chain rule I get to: $$\frac{\partial L}{\partial b_i}=\frac{\partial L}{\partial o_i}\frac{\partial o_i}{\partial b_i} = (p_i - y_i)1=p_i - y_i$$ Which makes sense (dimensionality wise) $$\frac{\partial L}{\partial W_i}=\frac{\partial L}{\partial o_i}\frac{\partial o_i}{\partial W_i} = (p_i - y_i)\vec{x}$$ Which has a dimensionality mismatch
(for example if dimensions are $W_{3\times 4},\vec{b}_4,\vec{x}_3$)
What am I doing wrong ? and what is the correct gradient ?
The dimension mismatch appears when you are using the chain rule. In case of taking the derivative with respect to $W_i$ (which denotes the $i$-th row of $W$, right?), we have maps $$ W_i \in \mathbf R^{1 \times k} \mapsto o_i = W_ix+b_i \in \mathbf R \mapsto L \in \mathbf R $$ hence a function $\mathbf R^{1 \times k} \to \mathbf R$, therefore the derivative is a map $$ \mathbf R^{1 \times k} \to L(\mathbf R^{1 \times k}, \mathbf R)$$ which assigns to each point $W_i \in \mathbf R^{1 \times k}$ a linear map $\mathbf R^{1 \times k} \to \mathbf R$. The chain rule tells us that for $h \in \mathbf R^{1 \times k}$, we have $$ \def\pd#1#2{\frac{\partial #1}{\partial #2}}\pd{L}{W_i}h = \pd{L}{o_i}\cdot \pd{o_i}{W_i}h $$ Now, as $W_i \mapsto o_i$ is affine, the derivative at any point equals the linear part, that is $$ \pd{o_i}{W_i} = hx, \qquad h \in \mathbf R^{1 \times k} $$ Therefore $$ \pd L{W_i}h = (p_i - y_i)hx $$ that is $\pd{L}{W_i}$ is the linear map $$ \mathbf R^{1 \times k} \ni h \mapsto (p_i - y_i)hx $$