Markov Decision Process - Discounted return

101 Views Asked by At

I was reading this article about Discounted return (in the context of MDP): http://deeplizard.com/learn/video/a-SnJtmBtyA

I got the section:

Now, check out this relationship below showing how returns at 
successive time steps are related to each other. 
We’ll make use of this relationship later.

[please use the image url (below) if it doesn't appear here][1]

[1]: https://i.stack.imgur.com/KgOML.png

The extract shows 3 maths formulas. I have 3 questions on this:

1- I noticed that Rt + 3 is occurring twice, in the time steps. Could this be a typo error? I.e. shouldn't the next time step be Rt + 4? Or is this correct? If so, then it doesn't make sense to me.

2- I didn't understand how in the third formula at the bottom, we changed it to yGt+1?

3- Why would the discount y be increased exponentially (i.e. its power increases by one) with every time step? Doesn't that seem like a dramatic increase in the discount with every time step (as opposed to maybe multiply the the discount by an increasing co-efficient that is equal to the time step value)?

Many thanks in advance for any help.

1

There are 1 best solutions below

2
On BEST ANSWER
  1. this is a typo, it should be $(t+4)$
  2. The definition of $G_t$ is $$ G_t = \sum_{k=1}^{\infty} \gamma^k R_{t+k+1} $$ this is a function of $t$. So if we wanted to find $G_{t+a}$ for some number $a$, we would have: $$ G_{t+a} = \sum_{k=1}^{\infty} \gamma^k R_{(t+a) +k+1} = R_{t+a+1} + \gamma R_{t+a+2} + \gamma^2 R_{t+a+3} + \dots $$
  3. The choice of gamma determines how much you want to care about actions in the past/future. If you take $\gamma=0.5$, then you are saying that you only care about rewards that are coming 1 step in the future half as much as you care about a reward right now. If you want to care more about future rewards, you increase $\gamma \to 1$. It's not that dramatic depending on the choice of $\gamma$, and in most applications you don't want to give too much weight to very distant returns, using an exponential discounting gives us that.