How to find the optimal linear basis functions of an MDP?

189 Views Asked by At

Given a set of basis functions, there are many papers on finding a weight vector to linearly approximate the value function.

Is there any paper on how to find the basis functions? Is it possible to compute the optimal basis functions to represent the value function compactly?

2

There are 2 best solutions below

0
On

What you are basically looking for is to automatically find the correct model for your data: i.e. Model Selection

Unless you have some information to support some strong priors, it is not likely that you can automatically find an optimal solution. Typically you will do some feature engineering or take a Bayesian approach to select between different hypotheses.

0
On

This answer might be five years late but hopefully it satisfies an otherwise great question.

As to my knowledge, the way to to this is to solve an inner and outer problem. Some authors might refer to this as a fast and slow time scale. For clarity, lets establish the linear function approximator to be $F(x\mid\theta,w) = \sum_{i=1}^N w_i \phi_i(x\mid \theta)$ where $\phi(\mid\theta)$ is the parameterised basis functions, $\theta$ is a parameter vector and $w$ are your linear weights.

The inner problem solves a Markov Decision Process. This pertains to finding $w_n^*$, $\pi_n^*$ (optimal weights and optimal policy) using parameter vector $\theta_n$. Of course the state-value functions are determined as well. However, what is important here is that the weights have been optimised. The inner problem can be solved using model-based methods such as Value Iteration and Policy iteration or by model-free (online) methods such as SARSA and Q-learning.

The outer problem optimises $\theta_{n+1}$ with $w_n^*$ fixed as to obtain $\theta_{n+1}^*$. To do this, it needs to evaluate the performance/score of the MDP. If you are working with average long-run reward Bellman equations as opposed to the more popular discounted ones then solving for the bias would suffice. However, simulating the MDP and computing the trajectory's reward is more simple. The outer problem is typically not convex such that gradient-based optimisation is not of much use (if one cares about near global optima then this matter). Meta-heuristics could be used. However, the Cross Entropy Method is a very elegant and well researched way of performing the outer problem. Here is a paper that uses it: Basis Function Adaptation in Temporal Difference Reinforcement Learning by Menache, Mannor and Shimkin. I would give a hyperlink if I could. The paper immediately goes onto the local drive. However, if you google the title then you should get the desired freely available paper.

If you are willing to invest, then chapter 4.4 of Reinforcement Learning and Dynamic Programming Using Function Approximators covers the same thing. Note that the inner-outer problem resembles a hill-climbing or expectation-maximisation algorithm that produces a level-optimising sequence from $\theta_0$ such that $\{ (\theta_0^*,w_0^*,L_0),(\theta_1^*,w_1^*,L_1),\cdots,(\theta_T^*,w_T^*,L_T) \}$ where $L_T \geq L_{T-1} \cdots L_1 \geq L_0$. Here, $L_i(w_i^*,\theta^*_i)$ is the performance/score function of an MDP.

Lastly, do not be surprised if using a Gaussian Process Surrogate model in conjunction with an acquisition function such as Expected Improvement yields good results for the outer problem.