Are the eligibility traces for neural network the same as the general algorithm in Reinforcement Learning by Sutton and Barto?

535 Views Asked by At

To define the TD($\lambda$) algorithm, the authors keep track of a vector of eligibility traces $E(S)$ for each state $S$ the agent has encountered during an episode, and these traces decay at a rate of $\lambda \gamma$, where $\gamma$ is the rate that the rewards are discounted. When the a particular state is visited, the eligibility trace increases by one, then begins to decay as new actions are taken until that state is reached again. After each step of the episode, the value function $V(S)$ is updated based on a factor of $E(S)$, so that more recent and more often visited states are updated by a larger factor than states from long ago. The algorithm is shown below:

enter image description here

Later, though, when the authors talk about using the TD($\lambda$) with a neural network, the eligibility traces keep track of which parameters have been most recently updated instead of the states that were most recently updated. At least that's what it seems like to me. The algorithm is shown below:

enter image description here

So it seems like these are different algorithms to me. Are they different?

1

There are 1 best solutions below

0
On

These algorithms are obviously different to a degree, one concerns discrete state spaces and the other uses a neural network to describe a continuous state space. However I think that is the only substantial difference, or in other words, the 2nd algorithm is the generalisation of the 1st to continuous spaces.

Your understanding of the first algorithm seems correct. The eligibility trace keeps track of how eligible each state is for a potential update upon receiving a reward.

When we then extend that to a continuous state space the value function is no longer a discrete value assigned to each state, rather it is represented through the weights, w. So the algorithm keeps track of which weights are eligible for update. It does this by measuring which weights contributed most to the value of the state, as quantified by the derivative $\nabla v(S, \Bbb{w})$.

I think your misunderstanding is in the way the trace tracks weights. You say the eligibility trace keeps track of which weights have been changed, but actually it is the other way round, the eligibility trace determines how the weights change. Rather, the eligibility trace keeps track of the weights that contributed most to recent states, in the same way that the discrete eligibility trace kept track of the most recently visited states.

Hope that helps somewhat!