I need your help in understanding the following problem:
Given equations (80) and (78), one need to derive equation (81) using the chain rule from calculus, however from where does the $y_j$ comes from? since in either equations (80) and (78), this term does not appear there. These equations are stated in Neural Networks and Deep Learning.
Please advise.
Thanks in advance.
Let the $j$th activation output be $$a_j=\frac{\exp(z_j)}{S},\;\;S=\sum_{t\in\mathcal{O}} \exp(z_t)$$ for outputs $\mathcal{O}$. The input is given by $$ z_k = \sum_{i\in\mathcal{I}} w_{ki}\tilde{a}_i + b_k $$ for inputs $\mathcal{I}$. Then, the log-likelihood cost $$ C = -\ln(a_y)= -\left[ \sum_{k\in\mathcal{I}} w_{yk}\tilde{a}_k + b_y \right] + \ln(S) $$ with derivative \begin{align} \frac{\partial C}{\partial b_j} &= -\delta_{yj} + \frac{1}{S}\frac{\partial S}{\partial b_j}\\ &= -{y_j} + \frac{1}{S}\sum_{t\in\mathcal{O}}\exp(z_t) \frac{\partial z_t}{\partial b_j} \\ &= -{y_j} + \frac{\exp(z_j)}{S} \\ &= a_j - y_j \end{align} where $\delta_{yj}=y_j$ is the Kronecker delta.
The author remarks in a (cryptic) sidenote thet $y_j$ is the vector of zeros except at the $j$th position.