Finding partial derivative of a matrix

104 Views Asked by At

enter image description here

In the image replace s with h

$A$: $n\times{h}$ matrix of hidden layer inputs

$X$: $n\times{d}$ matrix of data inputs of dimensionality d

$U$: $d\times{h}$ matrix

$H$: $n\times{h}$ matrix of hidden layer outputs for each batch, at each timestep

$W$: $h\times{h}$ matrix

$V$: $h\times{v}$ matrix

$b$: $1\times{h}$ bias term

In the recurrent neural network, input to the hidden state output at timestep is defined as:

$$ {A}_{n\times h}=X_{n\times{d}}U_{d\times{h}} + H_{n\times{h}}W_{h\times{h}} + \bf b $$

Here is the same equation but with timestep notation for clarity:

$$ {A}^{(t)}=X^{(t)}\,U + H^{(t-1)}W + \bf b $$

Hidden layer output of each time step is a tangens hyperbolic activation of its inputs:

$$H^{(t)}=tanh(A^{(t)})$$

which is used to calculate the output $ o^{(t)} $ for each of the hidden states $ h_i^{(t)} $, and to pass on to the next hidden state in the time series (could be imagined as passing to the right state in the figure).

$$ o^{(t)} = H^{(t)}V + c $$

$$ \hat{y}_j^{(t)} = softmax(o^{(t)}) = \frac{o_j^{(t)}}{\sum_{k} o_k^{(t)}} $$

L is a cross entropy loss function, and $y$ is the correct output vector

$$ L = - \frac{1}{T} \sum_{t=1}^{T}\sum_{j=1}^{v} \hat{y}_{tj} \times log(y_{tj}) $$

During the backpropagation we need to calculate derivative of the loss w.r.t parameters U, W, V, b, c.

Given $ \frac{\partial{L}}{\partial{A}} $ of size $ n\times{h} $ The derivative of L w.r.t parameter U is :

$$ \frac{\partial{L}}{\partial{U}} = X^{T} \frac{\partial{L}}{\partial{A}} $$

How do i derive this result? What are the rules and operators i need to know to produce this.

1

There are 1 best solutions below

1
On

To reduce unnecessary clutter, drop the $t$-superscripts, use lowercase letters for vectors and uppercase for matrices. Then write the differential of $L$ in terms of $da$ and perform a change of variables to $dU$ $$\eqalign{ dL &= \frac{\partial L}{\partial a}:da \cr &= \frac{\partial L}{\partial a}:x\,dU \cr &= x^T\Big(\frac{\partial L}{\partial a}\Big):dU \cr \frac{\partial L}{\partial U} &= x^T\Big(\frac{\partial L}{\partial a}\Big) \cr }$$ since $(x,a)$ are row vectors, the resulting gradient is a $d\times h$ matrix.

NB: A colon denotes the trace/Frobenius product, i.e. $$A:B = {\rm Tr}(A^TB)$$