Backpropagation through time - Gradient Calculation

30 Views Asked by At

I think I got it right after reading multiple resources:

$W_h$ is the hiddel weight matrix

$W_i$ is the input weight matrix

$W_o$ is the output weight matrix

The simples one should be for $W_o$ $$ \displaylines{ \frac{ \partial E } { \partial \mathbf{W}_o} = \frac{1}{T} \sum^T_{t=1} \frac{ \partial L(\hat{\mathbf{y}}^{<t>},\mathbf{y}^{<t>}) } { \partial \hat{\mathbf{y}}^{<t>}} \frac{ \partial \hat{\mathbf{y}}^{<t>}} { \partial \mathbf{W}_o} } $$ For for $W_h$ i got this: $$ \displaylines{ \frac{ \partial E } { \partial \mathbf{W}_h } = \frac{ 1 }{ T } \sum^T_{t=1} \frac{ \partial L(\hat{\mathbf{y}}^{<t>},\mathbf{y}^{<t>}) } { \partial \mathbf{W}_h } \\= \frac{ 1 }{ T } \sum^T_{t=1} \frac{ \partial L(\hat{\mathbf{y}}^{<t>},\mathbf{y}^{<t>}) } { \partial \hat{\mathbf{y}}^{<t>} } \frac{ \partial \hat{\mathbf{y}}^{<t>} } { \partial \mathbf{h}^{<t>} } \frac{ \partial \mathbf{h}^{<t>} } { \partial \mathbf{W}_h } \\= \frac{ 1 }{ T } \sum^T_{t=1} \frac{ \partial L(\hat{\mathbf{y}}^{<t>},\mathbf{y}^{<t>}) } { \partial \hat{\mathbf{y}}^{<t>} } \frac{ \partial \hat{\mathbf{y}}^{<t>} } { \partial \mathbf{h}^{<t>} } \sum_{k=0}^t \left( \left( \prod_{s=k}^{t-1} \frac{ \partial \mathbf{h}^{<s+1>} } { \partial \mathbf{h}^{<s>} } \right) \frac{ \partial \mathbf{h}^{<k>} } { \partial\mathbf{W}_h} \right)}$$

For $W_i$ it would be the same as for $W_h$ $$ \displaylines{ \frac{ \partial E } { \partial \mathbf{W}_i} = \frac{ 1 }{ T } \sum^T_{t=1} \frac{ \partial L(\hat{\mathbf{y}}^{<t>},\mathbf{y}^{<t>}) } { \partial \hat{\mathbf{y}}^{<t>} } \frac{ \partial \hat{\mathbf{y}}^{<t>} } { \partial \mathbf{h}^{<t>} } \sum_{k=0}^t \left( \left( \prod_{s=k}^{t-1} \frac{ \partial \mathbf{h}^{<s+1>} } { \partial \mathbf{h}^{<s>} } \right) \frac{ \partial \mathbf{h}^{<k>} } { \partial \mathbf{W}_h } \right)} $$

Here is an image of my thought for reference: [1]: https://i.stack.imgur.com/D0TKh.png