Chain rule extravaganza - how to derive this?

191 Views Asked by At

I have a very simple algorithm, and I need to calculate the derivative of an error function, and it gets a bit messy with chain rule.

I have a question whether I'm doing this correctly. To be more precise, my question is why is this correct. I'm taking this from an oxford lecture so I assume it isn't wrong, but I would derive it differently.

The equations:

We have equation for states: $$s_t = \theta_s\phi(s_{t-1})+\theta_x x_t$$

Where $\theta_s$ is the weight associated with $s$ the states, $\theta_x$ is the weight associated with $x$ our input, and $\phi$ is some differentiable function.

We also have an equation for the output of the machine $$y_t = \theta_y\phi(s_t)$$

Furthermore we have error functions for each time step:

$E_t = \frac{1}{2}(y_t-x_t)^2$ and a total error function: $E = \sum_{t=1}^{n}E_t$

So it's fairly simple. we have inputs and we want our outputs to be similar. Our outputs are dependent on states, and the states are a function of the previous state and current input.

The derivative:

I want to find $$\frac{\partial}{\partial \theta_s}E = \sum_{t=1}^{n}\frac{\partial}{\partial \theta_s}E_t$$

From chain rule (according to the lecture), $$\frac{\partial}{\partial \theta_s}E_t = \frac{\partial E_t}{\partial y_t}\frac{\partial y_t}{\partial s_t}\sum_{k=1}^{t}\frac{\partial s_t}{\partial s_k}\frac{\partial s_k}{\partial \theta_s}$$

My question is why this is true

Why not $$\frac{\partial}{\partial \theta_s}E_t = \frac{\partial E_t}{\partial y_t} \frac{\partial y_t}{\partial s_t} \frac{\partial s_1}{\partial \theta_s} \prod_{k=2}^{t}\frac{\partial s_k}{\partial s_{k-1}}$$

It seems like there are many other ways to represent this derivative with chain rule. are they all equal? Why is the form the professor chose correct? and not what i proposed?