The following is a lecture slide from a machine learning class:
Cross Entropy
For classification tasks, target $t$ is either $0$ or $1$, so better to use $$E=-t\log(z)-(1-t)\log(1-z)$$ This can be justified mathematically, and works well in practice -- especially when negative examples vastly outweigh positive ones. It also makes the backprop computations simpler $$\begin{align}\frac{\partial E}{\partial z}&=\frac{z-t}{z(1-z)}\\ \text{if}\qquad z&=\frac{1}{1+e^{-s}}\underset{\color{white}{\int}}{,}\\ \frac{\partial E}{\partial s}&=\frac{\partial E}{\partial z}\frac{\partial z}{\partial s}=z-t\end{align}$$
My understanding was that error functions are functions of a single variable. If so, then, if I'm not mistaken, $E$ in the above slide is a function of $z$ only. And $z$ is a function of $s$ only. Therefore, by the chain rule, shouldn't we have $\dfrac{ dE }{ ds } = \dfrac{ dE }{ dz } \dfrac{ dz }{ ds }$? After all, if what I claimed is correct, then none of these functions are multivariable functions -- rather, they are compositions (nested) functions.
But even if we assume that $E$ is a function of both $z$ and $t$, this still doesn't make sense. Why? Because $z$ is a function of one variable only -- it is a function of $s$. So we would have $\dfrac{ \partial{E} }{ \partial{s} } = \dfrac{ \partial{E} }{ \partial{z} } \dfrac{ dz }{ ds }$?
I would greatly appreciate it if people could please take the time to clarify this.