Bellman Equation in stochastic environment

93 Views Asked by At

I am trying to solve a grid question that is about assigning values to a grid using bellman equation. Before I show the grid and explain where I am stuck on, here are a few details about the problem. Discount factor = 0.9, noise = 0.2(means if we want to go up or down, 80 percent chance it will, 10% it can go left and 10% it can go right, so 10+10=20%=0.2, also same goes for right(10%up and 10% down) and left(10% up and 10% down)) and no living reward. The format of each grid element is this way (value, direction) and direction is the current best policy which at the beginning is randomly assigned. W also stands for wall, and if we hit it, we will get back to where we were before. Terminal states only have values.

So let's say our grid starts this way:

col1 col2 col3 col4
(0,up) (0,up) (0,right) (1)
(0,up) (W) (0,left) (-1)
(0,up) (0,up) (0,up) (0,down)

So after the first iteration we assign only one value, which is left of the 1. The new value is: (probability of going the direction of the policy * reward of taking that action) * dicount factor = (0.8 * 1) * 0.9 = 0.72

So we will have:

col1 col2 col3 col4
(0,up) (0,up) (0.72,right) (1)
(0,up) (W) (0,left) (-1)
(0,up) (0,up) (0,up) (0,down)

now after the next iteration the values underneath the 0.72 and to the left will be updated. Following the same patter the left one becomes 0.52=(0.78 * 0.8)*0.9, but why is the one underneath it turns to 0.43. The directions change as well.

col1 col2 col3 col4
(0,up) (0.52,right) (0.78,right) (1)
(0,up) (W) (0.43,up) (-1)
(0,up) (0,up) (0,up) (0,down)

Then in the next iteration we will also get:

col1 col2 col3 col4
(0.37,right) (0.66,right) (0.83,right) (1)
(0,up) (W) (0.51,up) (-1)
(0,up) (0,up) (0.31,up) (0,down)

If someone can just explain the process of how these are getting updated I would really appreciate it. I have to go over many iterations until the values stop changing, but if someone can explain just the couple of iterations above, I will learn the pattern and will do the rest.

1

There are 1 best solutions below

1
On BEST ANSWER

This looks like a standard policy-iteration or value-iteration algorithm to me. We utilize a corollary from the proof of Banach Contraction Principle (i.e. the fact that if $f$ is a contraction, then a sequence of subsequent iterations of $f(x)$ -- that is a sequence of form $(x,f(x),f(f(x)),...)$ - will converge to the fixed point of $f$ regardless of the choice of initial starting point $x$). A reminder: fixed point is a point which satisfies $x_0=f(x_0)$.

Therefore, your question probably is connected with the form of the mapping $f$ we use for updates -- please do correct me if I am wrong.

At every iteration we go through each accessible state (which basically means we skip the wall $W$) and update the value function $V$ (which are the numbers appearing next to the suggested moves) according to the Bellman equation:

$$V_t(s) \leftarrow \max_a \mathbb{E} [R(s,a)] + \gamma \cdot \sum_{s'} P(s,a,s') \cdot V_{t-1}(s')$$

A quick explanation of the symbols below:

  • $V_t(s)$ is value of a value function at the state $s$ at $t$-th iteration
  • $R(s,a)$ is reward for performing action $a$ at the state $s$. Due to the fact that effects of your movements are random, it is a random variable, hence we take the expected value.
  • $\gamma$ is the discount factor
  • $P(s,a,s')$ is a probability of reaching state $s'$ from state $s$ by performing the action $a$.
  • $V_{t-1}(s)$ is the value function from previous iteration.

I believe that during the first iteration you missed the step of assigning the value to the cell left of the one with reward equal to $-1$. After each iteration you should reassign the actions so that they maximize expected value function after the move you perform.

Please do note, however, that without any penalties for wandering around this labirynth your algorithm might have no real reason to go towards the exit -- it could just walk around highly valued states.