RNN Back propagation proof

46 Views Asked by At

Chapter 10 of the Deep Learning book has

$$ \begin{align} a^{(t)} &= b + Wh^{(t-1)} + Ux^{(t)} \\ h^{(t)} &= \tanh(a^{(t)})\\ o^{(t)} &= c + Vh^{(t)}\\ \hat{y}^{(t)} &= \text{softmax}(o^{(t)})\\ \\ L &= \sum_t L^{(t)}\\ &= -\sum_t \log{p_{\text{model}}(y^{(t)}\ |\ x^{(1)},\dots,x^{(t)})} \end{align} $$ where $p_{\text{model}}(y^{(t)}\ |\ x^{(1)},\dots,x^{(t)})$ is given by reading the entry for $y^{(t)}$ from the model's output vector $\hat{y}^{(t)}$.
...
$$ \frac{\partial L}{\partial L^{(t)}}=1\\ (\nabla_{\pmb{o}^{(t)}}L)_i = \frac{\partial L}{\partial L^{(t)}}\frac{\partial L^{(t)}}{\partial o_i^{(t)}} = \hat{y}_i^{(t)} - \pmb{1}_{i=y^{(t)}} $$

I got $\frac{\partial L^{(t)}}{\partial o_i^{(t)}} = \hat{y}_i^{(t)} - y_i^{(t)}$ as shown here. But how do we get the result in the book?

1

There are 1 best solutions below

2
On BEST ANSWER

I think both writing are identical.

In vector form, you can write $$ \frac{\partial L}{\partial \mathbf{o}} = \mathrm{softmax}(\mathbf{o})-\mathbf{y} $$ where vector $\mathbf{y}$ is null everywhere except at the position that indicates the class. For instance $y(2)=1$ if we are dealing with an example from the second class...