I'm new to deep learning and am attempting to calculate the derivative of the following function with respect to the matrix w:
$$p(a) = \frac{e^{w_a^Tx}}{\Sigma_{d} e^{w_d^Tx}}$$
Using quotient rule, I get: $$\frac{\partial p(a)}{\partial w} = \frac{xe^{w_a^Tx}\Sigma_{d} e^{w_d^Tx} - e^{w_a^Tx}\Sigma_{d} xe^{w_d^Tx}}{[\Sigma_{d} e^{w_d^Tx}]^2} = 0$$
I believe I'm doing something wrong, since the softmax function is commonly used as an activation function in deep learning (and thus cannot always have a derivative of 0). I've gone over similar questions, but they seem to gloss over this part of the calculation.
I'd appreciate any pointers towards the right direction.
Denote the elementwise (Hadamard) product by $A\odot B$, the inner (Frobenius) product by $A:B$, and the regular matrix product by $AB$.
Let $u$ be the vector of all ones, and define some additional vectors $$\eqalign{ z &= W^Tx, &\,\,dz= dW^Tx \cr e &= \exp(z), &\,\,de = e\odot dz \cr\cr }$$ Now find the differential of the $p$-vector $$\eqalign{ p &= \frac{e}{u:e} \cr\cr dp &= \frac{de}{u:e}-\frac{e\,(u:de)}{(u:e)^2} \cr &= \frac{e\odot dz}{u:e}-\frac{p\,(u:e\odot dz)}{u:e} \cr &= p\odot dz -p\,(p:dz) \cr &= \Big({\rm Diag}(p)-pp^T\Big)\,dz \cr &= (P-pp^T)\,dz \cr &= (P-pp^T)\,dW^T\,x \cr\cr }$$ Note that the gradient of a vector wrt a matrix will be a 3rd order tensor. Now continuing $$\eqalign{ dp &= (P-pp^T)\,{\mathcal E}\,x^T:dW^T \cr &= (P-pp^T)\,{\mathcal E}\,x^T:{\mathcal B}:dW \cr \frac{\partial p}{\partial W} &= (P-pp^T)\,{\mathcal E}\,x^T:{\mathcal B} \cr }$$ where $({\mathcal E}, {\mathcal B})$ are 4th order isotropic tensors whose components are $$\eqalign{ {\mathcal E}_{ijkl} &= \delta_{ik}\,\delta_{jl} \cr {\mathcal B}_{ijkl} &= \delta_{il}\,\delta_{jk} \cr }$$