Why is computing the "gradient" considered to be going "backwards" in time?

175 Views Asked by At

Maybe I am just overthinking but I am reading Deep Learning by Goodfellow. I am learning about Recurrent Neural Networks. I know this is just the minutiae, but it might be part of a bigger concept that may be useful to know. For context, here he talks about graphically/visually depicting a RNN in terms of a concise loop vs. just depicting every time step. I would imagine here "loss" means deviation from predicted value and actual value (related to loss function). I am not sure why loss is a forward process while gradient is a backward process.

enter image description here

enter image description here

2

There are 2 best solutions below

1
On BEST ANSWER

Having done a bit of research into RNNs, "time steps" is actually a unique term to an RNN. Equating this to a standard neural network doesn't quite work as the concept of time doesn't apply very well.

The big difference here is that an RNN can have an infinite number of inputs that all contribute to one (or many) outputs. A normal neural network can not do this. Each of these inputs is considered a time step.

A super common example is stock prices. An RNN can take the stock price each day, and predict the next days stock price based on the new input and what it remembers from all previous inputs. This prediction of the following days stock price is considered a time step. So as time goes forwards, previous data and new data are combined to get outputs, and when you compare the output to the expected, compute loss.

When you wish to apply backpropagation, this is when you need to go backwards in time. At whichever time step you're at, you need to use the current loss, and any gradients backpropagated from the future (if this is the most recent time step, there wont be any gradients from the future) to calculate the relevant parameter updates.

Goodfellow chose the wording with purpose here, to try and illustrate the difference between an RNN and any other neural net.

0
On

Any neural network could be seen as a composition of layers. If you look at each layer as an independent function with an input and output tensor, then the neural network will look like this (where $x$ is the input tensor of the network):

$f_L(f_{L-1}(...f_2(f_1(x))...)) $

such that each $f_l$ is a layer.

This could be written more concisely as:

$f_L \circ f_{L-1} \circ ... \circ f_2 \circ f_1$

You can observe that the last layer is written first so when you compute the gradient/derivative and apply the chain rule you will start from $f_L$ and going backward to $f_1$. Hence, this is what is meant by "backward in time".

I know that you're question is about RNNs but you could look at the unfolding of an RNN through time as adding layers to the network.