Matrix calculus for RNN equations

84 Views Asked by Bumbble Comm At 29 Mar 2026 - 10:35

In the deep learning book we have the standard RNN with these equations. It calculates various derivatives, including one for W.

I understand that:

$1 - {(h^{(t)})}^2 $ is coming from the derivative of $tanh$
$h^{(t-1)}$ is coming from the chain rule

What I don't understand:

where is the $diag$ coming from?
where is the transpose for $h^{(t-1)}$ is coming from?
why are they on the particular order they are in (apart from that this way the dimensions match). It feels like the gradient of L just got somehow between the parts of the derivate of $h^{(t)}$

Thank you