Gradient Chain Rule: Applying Gradient in the case of a Series of Matrix operations (Neural Net Gradient Calculation)

669 Views Asked by At

I have the following situation: I need to calculate the gradient of the Error of a CNN a few layers deep by hand. Starting with the Error function,

The $\operatorname{Error}[readoutX]= -\sum_i \sum_j {actualX_{ij} * \operatorname{log}(readoutX_{ij})}$

So, letting $actualX = \mathbf{a}$ and $readoutX = \mathbf{x}$, I need to take the gradient of the Error function.

The error function written out more fully, is:

$$ \begin{align} \operatorname[X] &= -\sum_i \sum_j a_{ij} * \operatorname{log}(x_{ij})\\ \mathbf{x} &= \operatorname{softmax}(\mathbf{B}\bullet \mathbf{A})\\ \mathbf{B} &= \mathbf{C} \bullet \mathbf{B}\\ \end{align} $$

$\text{ Where the actual dimensions of }\mathbf{A},\mathbf{B}, \text{ and }\mathbf{C} \text{ are } (1024,10), (1568,1024),(?,1568) \text{ respectively (}? = \text{number of images in batch}), \text{ and the softmax is defined as: }$ $$\sigma(\mathbf{x})_j = \frac{e^{x_j}}{\sum_{k=1}^{|\mathbf{x}|} e^{x_k}} \text{ for } j = 1,\ldots,|\mathbf{x}| $$

$\text{ with the partial derivative with respect to element }x_j:$

$$ \frac{\partial{\sigma(x)_i}}{\partial{x_j}} = \sigma(x)_i (\Delta_{ij} - \sigma(x)_j) $$


Next, I start to write the gradient:

$$ \nabla\operatorname{Error}[\mathbf{x}] = \sum_i \sum_j {- \frac{\partial{(a_{ij} * \operatorname{log}(x_{ij}))}}{\partial{x_{ij}}}} = \sum_i\sum_j -a_{ij}\frac{\partial{\operatorname{log}(x_{ij})}}{\partial{x_{ij}}} = \sum_i\sum_j -a_{ij}\frac{1}{x_{ij}}\frac{\partial{x_{ij}}}{\partial{u}} $$

And I am stuck at: $\frac{\partial{x_{ij}}}{\partial{u}}$.


I know that $\frac{\partial{x_{ij}}}{\partial{u}}$ should be a gradient, but I don't know how to implement it (I can just write the $\nabla$ in place of the partial and call it correct...but I have no idea if that is right...), and even after reading the Wikipedia article, I am not sure which chain rule applies here (or even if the first part of my gradient is correct). In the past, I've done this using matrix derivatives, but it seems like the summation breaks that down...

So, how do I write this first step...and transition to the next gradient (and then on to $\mathbf{C}$)...

1

There are 1 best solutions below

3
On BEST ANSWER

Let $$\eqalign{ M &= A^TB^T\otimes I_c \cr c &= {\rm vec}(C) \cr z &= Mc \cr X &= {\rm softmax}(CBA) \cr x &= {\rm vec}(X) = {\rm softmax}(z) \cr y &= {\rm vec}(a) \cr\cr }$$ For our purposes, the most convenient form for differential of the softmax function is $$\eqalign{ dx = ({\rm Diag}(x) - xx^T)\,dz \cr\cr }$$ Write the error function in terms of the Frobenius(:) Inner Product, and find its differential $$\eqalign{ E &= -y:\log(x) \cr\cr dE &= -y:d\log(x) \cr &= -y:\frac{dx}{x} \cr &= -\frac{y}{x}:dx \cr &= -\frac{y}{x}:({\rm Diag}(x) - xx^T)\,dz \cr &= (xx^T-{\rm Diag}(x))\,\Big(\frac{y}{x}\Big) : dz \cr &= (xx^T-{\rm Diag}(x))\,\Big(\frac{y}{x}\Big) : M\,dc \cr &= M^T(xx^T-{\rm Diag}(x))\,\Big(\frac{y}{x}\Big) : dc \cr }$$ where $\frac{y}{x}$ represents element-wise (aka Hadamard) division of two vectors.

Since $dE=\big(\frac{\partial E}{\partial c}:dc\big),\,$ looking at that last line, the gradient must be $$\eqalign{ \frac{\partial E}{\partial c} &= M^T(xx^T-{\rm Diag}(x))\,\Big(\frac{y}{x}\Big) \cr }$$