I have the following experiment and I'm struggling to find what is the optimal solution given some assumptions:
An agent has to complete a certain number of trials and her objective is to maximize the average reward rate. In each trial, a reward $R=1$ (its value doesn't really matter) is available after some time $T$, measured from the beginning of the trial, that follows an exponential distribution with rate $\lambda$. The agent can wait to collect the reward, or forfeit that trial and move on to the next. Between each trial there is always a fixed inter-trial-interval $I$, regardless of what the agent did.
What I want to obtain is the optimal decision rule as a function of $I$, assuming that the agent knows $\lambda$. If I'm not wrong, for a given trial, the expected reward rate as a function of waited time $w$ is:
$$ \frac{P(T<=w)}{w+I} = \frac{1-\exp(-\lambda w)}{w+I} $$
which has a maximum at $w^{*} = -W^{-1}(e^{-(I+1)})-(I+1)$, where $W^{-1}$ is the product logarithm function. So one could say that the optimal strategy for one trial is to wait until $w^{*}$.
However, I'm unsure on how to extend this to consider an arbitrarily large number of trials $N$. Would the same strategy apply for every trial?
As $T$ is exponential (memoryless), the distribution of the remainder given the time already passed during the current trial is distributed the same as a new trial. Therefore you always have to wait $T$ more time. But if you restart then you will wait $T+I$. So never restarting is an optional policy for any $I>0$.