I am trying to solve a grid question that is about assigning values to a grid using bellman equation. Before I show the grid and explain where I am stuck on, here are a few details about the problem. Discount factor = 0.9, noise = 0.2(means if we want to go up or down, 80 percent chance it will, 10% it can go left and 10% it can go right, so 10+10=20%=0.2, also same goes for right(10%up and 10% down) and left(10% up and 10% down)) and no living reward. The format of each grid element is this way (value, direction) and direction is the current best policy which at the beginning is randomly assigned. W also stands for wall, and if we hit it, we will get back to where we were before. Terminal states only have values.
So let's say our grid starts this way:
| col1 | col2 | col3 | col4 |
|---|---|---|---|
| (0,up) | (0,up) | (0,right) | (1) |
| (0,up) | (W) | (0,left) | (-1) |
| (0,up) | (0,up) | (0,up) | (0,down) |
So after the first iteration we assign only one value, which is left of the 1. The new value is: (probability of going the direction of the policy * reward of taking that action) * dicount factor = (0.8 * 1) * 0.9 = 0.72
So we will have:
| col1 | col2 | col3 | col4 |
|---|---|---|---|
| (0,up) | (0,up) | (0.72,right) | (1) |
| (0,up) | (W) | (0,left) | (-1) |
| (0,up) | (0,up) | (0,up) | (0,down) |
now after the next iteration the values underneath the 0.72 and to the left will be updated. Following the same patter the left one becomes 0.52=(0.78 * 0.8)*0.9, but why is the one underneath it turns to 0.43. The directions change as well.
| col1 | col2 | col3 | col4 |
|---|---|---|---|
| (0,up) | (0.52,right) | (0.78,right) | (1) |
| (0,up) | (W) | (0.43,up) | (-1) |
| (0,up) | (0,up) | (0,up) | (0,down) |
Then in the next iteration we will also get:
| col1 | col2 | col3 | col4 |
|---|---|---|---|
| (0.37,right) | (0.66,right) | (0.83,right) | (1) |
| (0,up) | (W) | (0.51,up) | (-1) |
| (0,up) | (0,up) | (0.31,up) | (0,down) |
If someone can just explain the process of how these are getting updated I would really appreciate it. I have to go over many iterations until the values stop changing, but if someone can explain just the couple of iterations above, I will learn the pattern and will do the rest.
This looks like a standard policy-iteration or value-iteration algorithm to me. We utilize a corollary from the proof of Banach Contraction Principle (i.e. the fact that if $f$ is a contraction, then a sequence of subsequent iterations of $f(x)$ -- that is a sequence of form $(x,f(x),f(f(x)),...)$ - will converge to the fixed point of $f$ regardless of the choice of initial starting point $x$). A reminder: fixed point is a point which satisfies $x_0=f(x_0)$.
Therefore, your question probably is connected with the form of the mapping $f$ we use for updates -- please do correct me if I am wrong.
At every iteration we go through each accessible state (which basically means we skip the wall $W$) and update the value function $V$ (which are the numbers appearing next to the suggested moves) according to the Bellman equation:
$$V_t(s) \leftarrow \max_a \mathbb{E} [R(s,a)] + \gamma \cdot \sum_{s'} P(s,a,s') \cdot V_{t-1}(s')$$
A quick explanation of the symbols below:
I believe that during the first iteration you missed the step of assigning the value to the cell left of the one with reward equal to $-1$. After each iteration you should reassign the actions so that they maximize expected value function after the move you perform.
Please do note, however, that without any penalties for wandering around this labirynth your algorithm might have no real reason to go towards the exit -- it could just walk around highly valued states.