Differentiating Cross Entropy Error in Bishops Machine Learning. Chapter on Neural Networks

72 Views Asked by At

I have been trying to solve this problem from Bishop's Machine Learning chapter 5 for the past few hours, but I am confused as to how to show the below identity. I know I have to take a partial derivative wrt $a_k$ but I don't know how.

Show that the derivative of E(w) $$E\left(w\right)=-\sum _{g=1}^G\:\left(t_g\cdot ln\left(y\left(a_g\right)\right)+\:\left(1-t_g\right)\cdot ln\left(1-y\left(a_g\right)\right)\right) $$ wrt $a_g$ for output having logistic sigmoid activ. function where $y\left(a_g\right)=σ\left(a_g\right)$ satisfies

$$ \frac{\partial E}{\partial a_k g}=\left(y\left(a_g\right)-t_{g\:}\right),\:and\:given\:\frac{\partial σ\left(a_g\right)\:}{\partial \:a_g}=σ\left(a_g\right)\left(1-σ\left(a_g\right)\right) $$

2

There are 2 best solutions below

2
On

First differentiating the equation for $E$ with respect to $a_k$ (I am assuming $t_g$ is a constant): $$\frac{\partial E}{\partial a_k} = -\frac{\partial }{\partial a_k} \sum_g (t_g \cdot ln(y(a_g)) +(1-t_g)\cdot ln(1-y(a_g))) $$

$$ = -\sum_g \left[t_g \frac{y'(a_g)\delta_{gk}}{y(a_g)} + (1-t_g)\frac{-y'(a_g)\delta_{gk}}{1-y(a_g)}\right] $$

where we have used the chain rule and $\frac{\partial a_g}{\partial a_k} = \delta_{gk}$ where $\delta$ is the Kronecker delta. Now evaluating the above sum becomes simple:

$$ \frac{\partial E}{\partial a_k} = - t_k\frac{y'(a_k)}{y(a_k)} + (1-t_k)\frac{y'(a_k)}{1-y(a_k)} $$

From here on I will drop the $k$ subscript to reduce clutter. Now we can plug in the property you gave for the sigmoid:

$$y'=y(1-y)$$

$$\implies \frac{\partial E}{\partial a_k} = -t \frac{y'}{y} + (1-t)\frac{y'}{1-y}$$ $$ = -t \frac{y(1-y)}{y}+(1-t)\frac{y(1-y)}{1-y}$$ $$ = -t(1-y) + (1-t)y$$

Finally, reintroducing the subscripts, we have

$$\frac{\partial E}{\partial a_k} = y(a_k) - t_k$$

0
On

$\def\c#1{\color{red}{#1}}\def\o{{\tt1}}\def\d{{\rm diag}}\def\D{{\rm Diag}}\def\E{{\cal E}}\def\p{{\partial}}\def\grad#1#2{\frac{\p #1}{\p #2}}\def\hess#1#2#3{\frac{\p^2 #1}{\p #2\,\p #3^T}}$Drop the subscripts and apply the logistic function elementwise on the vector argument $a$ to generate the vector $y,\;$ i.e. $$\eqalign{ y &= \sigma(a) \quad&\implies\quad dy = (y-y\odot y)\odot da \\ Y &= \D(y) \quad&\implies\quad dy = (Y-Y^2)\,da \\ }$$ where the Hadamard product $(\odot)$ can be replaced by matrix multiplication by a diagonal matrix. The same technique can be applied to the elementwise log functions $$\eqalign{ w &= \log(y)\quad&\implies\quad dw = \frac{dy}{y} = Y^{-1}dy \\ x &= \log(\o-y)\quad&\implies\quad dx = \frac{dy}{y-\o} = (Y-I)^{-1}dy \\ }$$ It will be convenient to use a colon to denote the trace/Frobenius product, i.e. $$\eqalign{ A:B &= \sum_{i=1}^m \sum_{j=1}^n A_{ij} B_{ij} \;=\; {\rm Tr}(AB^T) \\ }$$ since this allows the cost function to be written without an explicit summation symbol.
This purely matrix form is easy to differentiate. $$\eqalign{ \E &= (t-\o):x - t:w \\ d\E &= (t-\o):dx - t:dw \\ &= (t-\o):(Y-I)^{-1}\c{dy} \;-\; t:Y^{-1}\c{dy} \\ &= (\o-t):(I-Y)^{-1}\c{(Y-Y^2)\,da} \;-\; t:Y^{-1}\c{(Y-Y^2)\,da} \\ &= (\o-t):Y\,da \;-\; t:(I-Y)\,da \\ &= Y(\o-t):da \;+\; (Y-I)t:\,da \\ &= (Y\o-It):da \\ &= (y-t):da \\ \grad{\E}{a} &= (y-t) \\ }$$