dependent data in Reinforcement learning

65 Views Asked by At

Hello I have a question about reinforcement learning. I was watching RL course by David Silver, and in lecture 6: Value Function Approximation, he says that in reinforcement learning, the data you get can be dependent, like non-iid data. However to my knowledge, we use Markov state to represent the state of agent in reinforcement learning, and a characteristic of Markov state is each state is independent, we don't need to know the state before previous state to make decision, then why is he saying that data can be dependent? Can someone clarify it for me? Thanks

1

There are 1 best solutions below

0
On

Suppose a set of states $\mathcal{S}$, a set of actions $\mathcal{A}$ and a reward function $\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$. We assume time to start at $t=0$ and then progressing as $t=0,1,2,\ldots$. Furthermore, the dynamics of the environment are assumed to be governed by a transition function $P:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}$, such that $\rm{Pr}(s_{t+1}=s''|s_t=s',a_t=a')=P(s',a')$. Next, let us introduce a decision mapping $\pi:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$, where $\pi(s,a)$ is to be understood as the probability to choose action $a$ when being in state $s$. Obviously, we require $\sum\limits_{a\in\mathcal{A}}\pi(s,a)=1\ \forall s\in\mathcal{S}$. Now, let us define $\rho:\mathcal{S}\rightarrow[0,1]$ as the starting state distribution, that is $\rm{Pr}(s_0=s)=\rho(s)$. Crucially, let us consider policies $\pi$ for which there exists a distribution $\eta_{\pi}:\mathcal{S}\rightarrow[0,1]$ such that $$\lim_{t\rightarrow\infty}\rm{Pr}(s_t=s|s_0\sim\rho)=\eta_{\pi}(s).$$ A reasonable goal could be to find a such a policy $\pi$ which maximizes the quantity $$\mathbb{E}_{s\sim\eta_{\pi}}[\mathcal{R}(s,\pi(s))].$$ When training a reinforcement learning agent, we usually sample $s_0$ using $\rho$ and then obtain a rollout by applying $\pi$ to sample actions. This is also known as Monte Carlo simulation. Suppose now that we were to sample n-step trajectories using Monte Carlo simulation. This can be seen as a distribution on $\underbrace{\mathcal{S}\times\ldots\times\mathcal{S}}_{n\ \rm{times}}$. Most likely, this distribution would, however, be very different compared to sampling $n$ times from $\eta_{\pi}$. Image for example a robot driving in a circle at constant velocity. When sampling from $\eta_{\pi}$, the robot could be anywhere on the circle. However, when considering a sample from a rollout, you have a pretty good idea about the state in the next sample (next timestep).

I.i.d. (independent and indentically distributed) itself is a bit of a loose term, since it's not clear which distribution is meant. However, consider for example a 2D game, where a humanoid actor runs from left to right and has to overcome obstacles in order not to die. Assume that there is a set of possible types of obstacles (spears coming out of the ground, stones falling from the sky, etc.). If we were to use a policy gradient method to train our agent, it is not hard to see that using single rollouts might pose a problem. While a certain policy (which never dies) might have a certain probability to encounter a certain obstacle (this probability could be derived from $\eta_{\pi}$), for a single $n$-step rollout it could well be possible to include only a single type of obstacle. The data not being i.i.d. sampled from $\eta_{\pi}$ is a problem since the gradient update will now be biased towards a single obstacle.