Optimizing a function through reinforcement learning.

215 Views Asked by At

I want to know if reinforcement learning can be used for such optimization problems: \begin{align} & \max_{p_1^t, p_2^t}\quad\log_2(1+ p_1^t h_1^t)+\log_2(1+ p_2^t h_2^t) \\[6pt] \text{s.t. } & p_1^t+p_2^t \leq P_T^t \\[6pt] & p_1^t \geq 0,p_2^t \geq 0. \end{align} where $h_1^t$ and $ h_2^t$ are positive random values at time $t$. The values of $ h_1^t$ and $ h_2^t$ are independent from each other and also independent from their vales in the previous times ($t-1, t-2, t-3$ ......). $p_1^t$ and $p_2^t$ are positive continuous real numbers. If it is possible to apply reinforcement learning, kindly, guide me that how can we take the action set to be finite. As, $p_1^t$ and $p_2^t$ can have any value between $0$ and $P_T$, for example if, $P_T^t=5$ then a possible action could be $P_1^t=2.1$ and $P_2^t=2.9$ another could be $P_1^t=2.15$ and $P_2^t=2.85$. Hence, there are infinite many possible actions.

Even if you are not sure about the answer, just tell me what do you think, your idea could help me in finding the answer myself.

1

There are 1 best solutions below

0
On BEST ANSWER

Normally, RL is used to optimize expectations over stochastic functions that are not differentiable, using stochastic gradient descent techniques. Of course, it can be applied essentially to any stochastic optimization problem in this sense... but usually that is a bad idea.

Anyway, usually when people want to use RL for optimization, they really mean the REINFORCE trick for log-derivatives. Using REINFORCE is usually the last resort, when nothing else works. It is given by: $$ \nabla U(\theta)= \partial_\theta\mathbb{E}_{x\sim p_\theta}[f(x)] = \mathbb{E}_{x\sim p_\theta}[f(x) \partial_\theta \log p_\theta(x)] $$

This means that we want to optimize some function $f$, which depends on random values from a distribution with parameters $\theta$. We want to find $\theta$ such that the expected value of $f$ is maximal. Notice that we don't really need to know much about $f$.

Anyway, for your case, especially since you tagged neural-networks, we can do something like this: let $g_\theta$ be a neural network (maybe an LSTM) that will output some values every time step (e.g. $p_i^t$), where $\theta$ are its weights. Every timestep you feed it information about the previous step and the values of e.g. $h_i^t$. Again, for your stated problem, this is seemingly not useful, but I don't know what your real problem is. Notice that $g_\theta(x_t)$ outputs a continuous vector. You can easily constrain the output to meet your constraints (e.g. run it through a scaled sigmoid to make it between $[0,M]$) so that they are satisfied by construction OR add terms to penalize the network output when it violates the constraints. Your "reward" is simply: $$ R(\theta) = \int_0^T f(g_\theta(x_t), x_t) \,dt $$ where $f$ is basically your log sum in this case.

In essence, the network will learn to output the best correct numbers via reinforcement learning training. Our target is to optimize the expectation: $$ U(\theta) = \mathbb{E}[R(\theta)] $$ using stochastic gradient descent. First, run some trajectories. You will get $N$ values of $R$, written $\{R_k\}$. We then update the parameters of the network as: $$ \theta \leftarrow \theta + \xi\nabla U $$ where $\xi$ is the step size and we estimate $\nabla U$ in an Monte Carlo manner: $$ \nabla U \approx \frac{1}{N} \sum_i R_i \sum_t \nabla_\theta \log p_\theta(x_t) $$ When actions are discrete, we treat $p_\theta(x_t)$ as a categorical distribution based on the network output. When actions are continuous (as in your case), simply have $g_\theta$ output the $\mu,\sigma$ of a Gaussian (which define the density $p_\theta$) and sample from it to get the actions. See say this or this for more.

Of course, since you didn't give much detail on how the other papers use RL, I can't really say much else. It does seem like unnecessary and possibly inefficient overkill for the problem you have described here though.