How to apply Bellman equation?

1.2k Views Asked by At

Bellman equation is the fundamental mathematical equation we learn about in reinforcement learning. I am self learning(not a quant) reinforcement learning theory and came across this equation.

U(s) <- R(s)+gamma*max(sum(T(s,a,s')(U(s')))

where U is the utility function, R reward, gamma discount factor and T transition function.

How should I use Bellman equation for a simple grid world?

Below is a simple grid world; Consider the naming of bottom to up grid as a and b and left to right grid as 1,2 and 3 with four actions up,down,left and right

Now our rewards at b1 is -100 and a2 as +10 and the rest of the grid has a reward of 0.

Let say I start from b3 positions then how should I calculate other state values?

enter image description here

1

There are 1 best solutions below

4
On BEST ANSWER

So, on one hand you should clarify what does the sum() range over, but let's start with the transition function $T(s,a,s')$ : it's a function of three arguments; the future state $s'$, the action taken $a$ and the current state $s$. Informally, $T$ records all the combinations of "where we are", "what action we take" and "where will we be next".

We know that both the state space and the action set are discrete (can be labeled as integer numbers) and rather small, so $T$ can be completely enumerated; it can be thought of as a table with $n_s \times n_a \times n_s$ entries.

However $T$ is also sparse (many, if not most, elements are $0$), since certain state jumps are not possible: in our gridworld example, one can only go from $a_3$ to either $a_2$ or $b_3$ (let's just number the states from 1 to 6 from now on and the actions U-L-D-R as 1 to 4), so $$T(s = 6,\cdot,\cdot) = \begin{array}{c} U \\L \\ D \\ R \end{array} \left( \begin{array}{c c c c c c} 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & \mathbf{1} & 0 \\ 0 & 0 & \mathbf{1} & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ \end{array} \right) $$

The other "slices" of $T$ correspond to the other 5 states.

The equation you wrote above models the value of being in the present state $s$. It should be written as $$U(s) = R(s) \, \gamma \max_i \left[ \sum_j T(s,a_i,s_j') U(s_j') \right]$$ so that we "marginalize out" the future state (by summing over $s_j'$) and then seek the action that maximizes that sum.

To emphasize: the value of any current state depends on the available choices and on the reward associated with each of those.

The above is only half of the story: the formula simply tells us the utility of every current state $U(s)$, but its definition is recursive: every action carries some consequences, and those in turn must be acted upon in order to maximize the total reward.

How do we actually compute it? We set an initial condition (the initial state $s_0 = 3$), as well as the known rewards $R(4) = -100$ and $R(2) = 10$, and $0$ elsewhere; compute $U$ for each possible action, and repeat (i.e set current state to $s_1$ and evaluate all possible $s_2$, etc. ) until we reach some desired final state $\hat{s}$ having $U(\hat{s}) = \max_s U(s)$.

At this point we backtrack, keeping note of which sequence of actions led to $\hat{s}$. This backward "integration" (if you think of the agent as a dynamical system literally moving through space) is the central principle of the Bellman equation (it's also called backward induction).