A proof of the memorisation capacity of long-short-term-memory (LSTM) networks (recurrent neural networks)

73 Views Asked by At

I've tried asking for a proof on datascience stackexchange, but I think I might get more luck over here.

The following equations define the LSTM. The input, forget and outputs gates are: $$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$$ $$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$$ $$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$$

And the cell state $c_t$ and hidden state $h_t$ are:

$$c_t = f_t \odot c_{t-1} + i_t \odot \text{tanh}(W_c[h_{t-1}, x_t] + b_c) $$ $$h_t = o_t \odot \text{tanh}(c_t)$$

Conventionally, $h_0 = c_0 = {\bf 0}$ (a vector of zeros), but we are not restricted to this particular choice in this question.

The task I would like to accomplish is the following:

Given an input sequence of $n$ numbers, I want to return the first number in the sequence after going through the whole sequence. I.e. given an input sequence: $(x_1, x_2, \cdots x_n)$, $x_i \in \mathbb{R}$ the system has to return the first input $x_1$ after reading $x_n$. To be a little bit more specific, we can either apply a transformation on $h_n$ or $c_n$ to obtain $x_1$.

We can make assume that the real numbers which we draw are confined to what we can express in a finite precision machine.

The question is thus - what would be a analytic solution for the weights (and biases) in order to make this happen (and a proof of it), assuming that $h_t, c_t \in R^m$.