Using Q-Learning to find best mixture of algorithms.

42 Views Asked by At

I'm working on an optimization procedure where one out of N strategies (actions) is run at each time step, leading to a decrease (reward) in a global optimization function. Exactly which strategy works best depends on many factors, like the problem at hand, the solver parameters, and the current state of the solver across time.

I implemented a simple Q-learning selection method for determining which of the strategies should be used at any given time. I have two questions about this:

  1. It seems the actions are not independent. With this I mean that applying one strategy (action) at a given time t might influence another strategy reward at time >t. Does this break the assumptions for using Q-Learning?

  2. For choosing the best action I normalize the vector of Q values to [0,1] and do a probabilistic selection based on this vector. I also tried two methods for the action selection:

  • one where the best action is selected with probability p or choosen randomly with probability 1-p, and
  • boltzmann

However, they don't seem to work as well (and requires more parameters). Does my method makes any sense? What could be its shortcomings? I am concerned that it might be working by luck, which might change with other inputs.