I am going through Data Mining by Ian Witten and I am having trouble with a certain development of an equation (p. 427).
Here it is:
$$\frac{\partial}{\partial a_j}(L) = \frac{\partial}{\partial a_j} \Biggl[-\sum_{k=1}^K {y_k\biggl[ a_k - log\biggl[ \sum_{c=1}^{K}exp(a_c) \biggl] \biggl]}$$ $$= -\biggl[ y_{k=j} - \frac{exp(a_{k=j})}{\sum_{c=1}^{K}exp(a_c)} \biggl] $$
Whereas i would write something like
$$ - \sum_{k=1}^K {y_k\biggl[ \Bbb 1_{k=j} - \frac{exp(a_{j})}{\sum_{c=1}^{K}exp(a_c)} \biggl]} $$ where $\Bbb 1_{k=j} = 1 $ if $k = j$ and 0 otherwise.
Here $L$ and $a_k$ and $y_k$ are scalars with $j$ and $k$ in $1...K$ for some $K$.
I went through this calculation many times and i still can't see how the author got that result.
Let's use the convention where uppercase latin letters are matrices, lowercase are vectors, and greek letters are scalars.
Define some auxiliary variables and their differentials $$\eqalign{ p &= \exp(a) &\implies dp = p\circ da \cr \theta &= 1:p &\implies d\theta = 1:dp \cr \beta &= \log(\theta) &\implies d\beta = \theta^{-1}d\theta \cr }$$ where functions are applied element-wise, $1$ is a vector with all elements equal to unity, and the symbols $(\circ)/(:)$ denote the Hadamard/Frobenius products respectively, e.g. $$\eqalign{ &c = a\circ b &\implies c_k = a_kb_k \cr &\lambda = a:b &= \sum_k a_kb_k \cr &x:(y\circ z) &= (x\circ y):z \cr\cr }$$
Now take the differential of the cost function and back-substitute until everything is in terms of $da$. $$\eqalign{ L &= y:(\beta 1-a) \cr\cr dL &= (y:1)d\beta-y:da \cr &= d\beta - y:da \cr &= \theta^{-1}d\theta - y:da \cr &= \theta^{-1}1:dp - y:da \cr &= \theta^{-1}1:(p\circ da) - y:da \cr &= \theta^{-1}(p\circ 1):da - y:da \cr &= (\theta^{-1}p - y):da \cr\cr \frac{\partial L}{\partial a} &= \theta^{-1}p - y \cr &= \frac{\exp(a)}{1:\exp(a)} - y \cr\cr }$$ If you are only interested in the $j$-th component of the gradient, then multiply this result by the corresponding basis vector $\{e_j\}$ $$\eqalign{ e_j^T\,\bigg(\frac{\partial L}{\partial a}\bigg) &= \frac{\partial L}{\partial a_j} \cr &= \theta^{-1}p_j - y_j \cr }$$