Differentiating Cross Entropy Error in Bishops Machine Learning. Chapter on Neural Networks

Question

Differentiating Cross Entropy Error in Bishops Machine Learning. Chapter on Neural Networks

72 Views Asked by Bumbble Comm At 01 Apr 2026 - 3:32

I have been trying to solve this problem from Bishop's Machine Learning chapter 5 for the past few hours, but I am confused as to how to show the below identity. I know I have to take a partial derivative wrt $a_k$ but I don't know how.

Show that the derivative of E(w) $$E\left(w\right)=-\sum _{g=1}^G\:\left(t_g\cdot ln\left(y\left(a_g\right)\right)+\:\left(1-t_g\right)\cdot ln\left(1-y\left(a_g\right)\right)\right) $$ wrt $a_g$ for output having logistic sigmoid activ. function where $y\left(a_g\right)=σ\left(a_g\right)$ satisfies

$$ \frac{\partial E}{\partial a_k g}=\left(y\left(a_g\right)-t_{g\:}\right),\:and\:given\:\frac{\partial σ\left(a_g\right)\:}{\partial \:a_g}=σ\left(a_g\right)\left(1-σ\left(a_g\right)\right) $$

Original Q&A

There are 2 best solutions below

**Bumbble Comm** · Answer 1 · 2020-07-30 14:38:57

First differentiating the equation for $E$ with respect to $a_k$ (I am assuming $t_g$ is a constant): $$\frac{\partial E}{\partial a_k} = -\frac{\partial }{\partial a_k} \sum_g (t_g \cdot ln(y(a_g)) +(1-t_g)\cdot ln(1-y(a_g))) $$

$$ = -\sum_g \left[t_g \frac{y'(a_g)\delta_{gk}}{y(a_g)} + (1-t_g)\frac{-y'(a_g)\delta_{gk}}{1-y(a_g)}\right] $$

where we have used the chain rule and $\frac{\partial a_g}{\partial a_k} = \delta_{gk}$ where $\delta$ is the Kronecker delta. Now evaluating the above sum becomes simple:

$$ \frac{\partial E}{\partial a_k} = - t_k\frac{y'(a_k)}{y(a_k)} + (1-t_k)\frac{y'(a_k)}{1-y(a_k)} $$

From here on I will drop the $k$ subscript to reduce clutter. Now we can plug in the property you gave for the sigmoid:

$$y'=y(1-y)$$

$$\implies \frac{\partial E}{\partial a_k} = -t \frac{y'}{y} + (1-t)\frac{y'}{1-y}$$ $$ = -t \frac{y(1-y)}{y}+(1-t)\frac{y(1-y)}{1-y}$$ $$ = -t(1-y) + (1-t)y$$

Finally, reintroducing the subscripts, we have

$$\frac{\partial E}{\partial a_k} = y(a_k) - t_k$$

**Bumbble Comm** · Answer 2 · 2021-05-31 03:38:23

$\def\c#1{\color{red}{#1}}\def\o{{\tt1}}\def\d{{\rm diag}}\def\D{{\rm Diag}}\def\E{{\cal E}}\def\p{{\partial}}\def\grad#1#2{\frac{\p #1}{\p #2}}\def\hess#1#2#3{\frac{\p^2 #1}{\p #2\,\p #3^T}}$Drop the subscripts and apply the logistic function elementwise on the vector argument $a$ to generate the vector $y,\;$ i.e. $$\eqalign{ y &= \sigma(a) \quad&\implies\quad dy = (y-y\odot y)\odot da \\ Y &= \D(y) \quad&\implies\quad dy = (Y-Y^2)\,da \\ }$$ where the Hadamard product $(\odot)$ can be replaced by matrix multiplication by a diagonal matrix. The same technique can be applied to the elementwise log functions $$\eqalign{ w &= \log(y)\quad&\implies\quad dw = \frac{dy}{y} = Y^{-1}dy \\ x &= \log(\o-y)\quad&\implies\quad dx = \frac{dy}{y-\o} = (Y-I)^{-1}dy \\ }$$ It will be convenient to use a colon to denote the trace/Frobenius product, i.e. $$\eqalign{ A:B &= \sum_{i=1}^m \sum_{j=1}^n A_{ij} B_{ij} \;=\; {\rm Tr}(AB^T) \\ }$$ since this allows the cost function to be written without an explicit summation symbol.
This purely matrix form is easy to differentiate. $$\eqalign{ \E &= (t-\o):x - t:w \\ d\E &= (t-\o):dx - t:dw \\ &= (t-\o):(Y-I)^{-1}\c{dy} \;-\; t:Y^{-1}\c{dy} \\ &= (\o-t):(I-Y)^{-1}\c{(Y-Y^2)\,da} \;-\; t:Y^{-1}\c{(Y-Y^2)\,da} \\ &= (\o-t):Y\,da \;-\; t:(I-Y)\,da \\ &= Y(\o-t):da \;+\; (Y-I)t:\,da \\ &= (Y\o-It):da \\ &= (y-t):da \\ \grad{\E}{a} &= (y-t) \\ }$$

Differentiating Cross Entropy Error in Bishops Machine Learning. Chapter on Neural Networks

There are 2 best solutions below

Related Questions in LINEAR-ALGEBRA

Related Questions in MULTIVARIABLE-CALCULUS

Related Questions in VECTORS

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions