I am reading about "Iterative Policy Evaluation" algorithm in the context of Reinforcement Learning (http://incompleteideas.net/book/ebook/node41.html). If I have understood this correctly - in a problem that contains multiple "states", this algorithm is used to estimate the "value" of being in any of these states. Once we are able to estimate these "values", we should then be able to study the optimality of selecting different actions to navigate these different states contingent on the state we are currently situated in (i.e. "policies").
For example, consider a problem in which there is there is a grid made of 16 squares. The goal is to reach the "shaded grey squares" by moving "up, down, left and right" (i.e. "actions"):
If we were to use the "Iterative Policy Evaluation" algorithm in this problem, we would be able to estimate the "value" (in this problem, lower values are considered better than higher values) of being in any "square" (i.e. "state"):
This part is understandable so far - the next part is what confuses me. In this particular problem, (after we have used the "Iterative Policy Evaluation" algorithm) we are able to "peak" and see the "value" of all neighboring states relative to the state we are currently in. This means that in the very immediate sense ("greedy"), depending on where we are currently located - we are able to infer which square (i.e. "state") to visit next and which "action" to take next : this is simply done by examining which "neighboring square" has the smallest "value":
This means, we can use the "Iterative Policy Evaluation" algorithm to get a "value map". For instance:
- If I am in Square 9, the "optimal policy" could be to go "UP, UP, LEFT" or "RIGHT, RIGHT, DOWN" or "UP, LEFT, UP" or "RIGHT, DOWN, RIGHT".
But as the diagram indicates, this is a "greedy policy" - and I am not sure if this "greedy policy" is ALWAYS optimal (i.e. leads to the shaded grey squares in the fewest number of movements). This leads me to my question - how can I be certain that "following the arrows" will ALWAYS lead me to the "shaded squares" in the least number of turns - should I sometimes ignore the arrows?:
In problems where there are far more "states" (e.g. 1000 squares), after we have estimated the values of all these states using the "Iterative Policy Evaluation" algorithm: is it possible that sometimes taking "non-greedy actions" (i.e. moving to a neighboring square having a "less-desirable value" than some other neighboring square) might result in you reaching the "shaded grey squares" in fewer movements (i.e. "more optimal") than a "greedy policy"?
In other words, is the "Iterative Policy Evaluation" algorithm able to estimate the values of states in such a way, such that : no matter which square you are currently, the greedy policy will always be the "optimal policy" (i.e. take you to the grey shaded squares in the fewest turns)? Or does the "Iterative Policy Evaluation" algorithm only estimate the values of states and the "optimal policy" remains to be determined using some other algorithm?
Thank you!
PS: I am aware of a theorem called the "Policy Improvement Theorem" that has the ability to update and improve the values of the states estimated by the "Iterative Policy Evaluation" - but my question still remains: Even when all states have had their optimal values estimated, will selecting the "greedy policy" at each state necessarily become the "overall optimal policy"? Is it possible that even when all states have had their optimal values estimated, perhaps selecting the policy that leads to the highest reward after "n-steps" instead of "1 step" could be better - or is there some math proof that suggests otherwise?




Such an algorithm does not guarantee that you will always arrive at an optimal policy. In only gets you to a local minimum which in general may or may not be the global one.
One can modify your example slightly to see this effect. Suppose that the 4 corners values are not equal in value but have small differences between them but each of them is still better than the inner values. The algorithm you described will still find its way to some corner but whether this is the corner with the lowest value depends on the starting square.
Edit: The text above describes what happens if you go through the search once. If you do this procedure multiple times with different random starting points, eventually you should find the global minimum as well. There is a big difference between theory and practical applications here. In theory you never know whether you have actually found all possible end states and it can take arbitrarily many tries to find the global minimum. In practice most real world problems only have a single local maximum or at most very few so a couple of tries is usually good enough.