I have a very simple function (a neural network actually), whose derivative I want to determine.
Let $\mathbf{x}\in\mathbb{R}^{1\times n}$, $\mathbf{W}=[W_1 \cdots W_i \cdots W_m]\in\mathbb{R}^{n\times m}$, and let $\sigma$ be the sigmoid function ($\sigma(x)=\frac{1}{1+\exp(-x)}$).
Let $\mathbf{z}=\sigma(\mathbf{x}\cdot \mathbf{W})$ (element-wise), and let $\mathbf{y}=\text{softmax}(\mathbf{z})$, so that $y_i=\frac{\exp(z_i)}{\sum_k \exp(z_k)}$ for $i=1,\ldots,m$. What I try to deduce is $\frac{\partial E}{\partial W_{ik}}$, where
$$E=-\sum_{j=1}^m t_j\log y_j$$
and $t_1,\ldots,t_m$ is a probability distribution.
My result is that
$$\frac{\partial E}{\partial W_{ik}}=\frac{\partial E}{\partial z_i}\frac{\partial z_i}{\partial{W_{ik}}}$$
The first term $\frac{\partial E}{\partial z_i}$ can be seen to be $y_i-t_i$ (this should be correct), while the second/last term should clearly be $\sigma'(\mathbf{x}\cdot W_i)\cdot x_k=\sigma(\mathbf{x}\cdot W_i)(1-\sigma(\mathbf{x}\cdot W_i))x_k$.
However, when I compare my error derivative to a numerical gradient calculator, my result is always off. Does anyone see a mistake?