Equation derivation in a book for backpropagation in vector-matrix form (Witten 2016)

68 Views Asked by At

I am going through Data Mining by Ian Witten and I am having trouble with a certain development of an equation (p. 427).

Here it is:

$$\frac{\partial}{\partial a_j}(L) = \frac{\partial}{\partial a_j} \Biggl[-\sum_{k=1}^K {y_k\biggl[ a_k - log\biggl[ \sum_{c=1}^{K}exp(a_c) \biggl] \biggl]}$$ $$= -\biggl[ y_{k=j} - \frac{exp(a_{k=j})}{\sum_{c=1}^{K}exp(a_c)} \biggl] $$

Whereas i would write something like

$$ - \sum_{k=1}^K {y_k\biggl[ \Bbb 1_{k=j} - \frac{exp(a_{j})}{\sum_{c=1}^{K}exp(a_c)} \biggl]} $$ where $\Bbb 1_{k=j} = 1 $ if $k = j$ and 0 otherwise.

Here $L$ and $a_k$ and $y_k$ are scalars with $j$ and $k$ in $1...K$ for some $K$.

I went through this calculation many times and i still can't see how the author got that result.

2

There are 2 best solutions below

0
On BEST ANSWER

Let's use the convention where uppercase latin letters are matrices, lowercase are vectors, and greek letters are scalars.

Define some auxiliary variables and their differentials $$\eqalign{ p &= \exp(a) &\implies dp = p\circ da \cr \theta &= 1:p &\implies d\theta = 1:dp \cr \beta &= \log(\theta) &\implies d\beta = \theta^{-1}d\theta \cr }$$ where functions are applied element-wise, $1$ is a vector with all elements equal to unity, and the symbols $(\circ)/(:)$ denote the Hadamard/Frobenius products respectively, e.g. $$\eqalign{ &c = a\circ b &\implies c_k = a_kb_k \cr &\lambda = a:b &= \sum_k a_kb_k \cr &x:(y\circ z) &= (x\circ y):z \cr\cr }$$

Now take the differential of the cost function and back-substitute until everything is in terms of $da$. $$\eqalign{ L &= y:(\beta 1-a) \cr\cr dL &= (y:1)d\beta-y:da \cr &= d\beta - y:da \cr &= \theta^{-1}d\theta - y:da \cr &= \theta^{-1}1:dp - y:da \cr &= \theta^{-1}1:(p\circ da) - y:da \cr &= \theta^{-1}(p\circ 1):da - y:da \cr &= (\theta^{-1}p - y):da \cr\cr \frac{\partial L}{\partial a} &= \theta^{-1}p - y \cr &= \frac{\exp(a)}{1:\exp(a)} - y \cr\cr }$$ If you are only interested in the $j$-th component of the gradient, then multiply this result by the corresponding basis vector $\{e_j\}$ $$\eqalign{ e_j^T\,\bigg(\frac{\partial L}{\partial a}\bigg) &= \frac{\partial L}{\partial a_j} \cr &= \theta^{-1}p_j - y_j \cr }$$

0
On

Ok, a little more digging would have gotten me into the solution...

Here all the $y_k$'s sum up to 1 because they are the elements of a "multinomial" vector (that's the way it is called in machine learning, although i don't see how it is related to the multinomial distribution, i'd settle more for "multinouilli")

So $$ - \sum_{k=1}^K {y_k\biggl[ \Bbb 1_{k=j} - \frac{exp(a_{j})}{\sum_{c=1}^{K}exp(a_c)} \biggl]} = \Biggl(\sum_{k=1}^K {y_k} *\frac{exp(a_{j})}{\sum_{c=1}^{K}exp(a_c)} \Biggl)- y_j = -\biggl[ y_{k=j} - \frac{exp(a_{k=j})}{\sum_{c=1}^{K}exp(a_c)} \biggl]$$