I am working with a multi-armed bandit problem in which there are finitely many arms, say $K$ of them. Here, $K\geq 2$ (for the specific problem I am looking at, each arm is a homogeneous and ergodic Markov chain on a finite state space $S$. The state space is common across all the arms).
At each time instant $t=0,1,2,\ldots$, only one of the arms may be selected. A state evolution is then observed on the selected arm. The unobserved arms also undergo state evolution, and hence there is a certain 'delay' associated with observing each arm. For any given arm, its delay is a positive integer that denotes how long it has been since it was last observed. For example, if the delay of arm $1$ is $5$, it means that arm $1$ was selected previously $5$ time instants ago (with reference to the current time), and has not been selected since then. The smallest value of delay for any arm can be $1$, signifying that this particular arm was sampled in the immediately previous time instant.
Suppose that $$\lambda(a,d),\quad a=1,2,\ldots,K,\quad d\geq 1$$ denotes a joint probability distribution on arms and delays. At any given time $t\in\{0,1,2,\ldots\}$, only the arm which has been selected at $t-1$ will have delay of $1$, whereas the delays of the remaining arms increment from their previous value by $1$.
I am interested in simulating the above setup on MATLAB. To begin with, I select each of the $K$ arms once. After that, I need to carefully select the next arm based on the information of the arm delays. I would like my simulation to finally reflect that the empirical distribution of arms and delays matches with the above $\lambda(a,d)$ with increasing number of iterations.
I am stuck at what probability distribution I should use to choose an arm at any given time, based on the arm selection and delay information available currently.
It would be helpful if anyone can give me some ideas.
PS: I will share here some of the approaches I have already tried.
Approach 1: Suppose $d_1(t),d_2(t),\ldots,d_K(t)$ denote the arm delays at time $t$ (among these, exactly one of them is $1$). Then, I tried selecting the arm at time $t+1$ according to the distribution $$\frac{\lambda(a,d_a(t))}{\sum\limits_{b=1}^{K}\lambda(b,d_b(t))}.$$ However, this does not lead to the empirical distribution of arms and delays to match with $\lambda(a,d)$ in the long run.
Approach 2: I also tried the following approach: Select arm at time $t+1$ according to the formula $$\text{arm at time }(t+1)=\arg\min\limits_{a\in\{1,2,\ldots,K\}}||\lambda_t(a,\cdot)-\lambda(a,\cdot)||_{\infty},$$ where $\lambda_t(a,d)$ denotes the empirical probability (at time $t$) of arm $a$ being selected when its delay is $d$. This too does not lead to convergence of empirical distribution to $\lambda(a,d)$ in the long run.
(Skip ahead to Solution starts here to skip the buildup.)
One thing to keep in mind is that the joint probability distribution $\lambda$ needs to satisfy some constraints apart from the usual conditions on probability distributions.
The distribution of delays for a given arm $P(d|a)$ will directly affect $P(a)$. To see this intuitively, note that for a given arm $a_i$, if $\lambda(a_i, 1)$ is non-zero and $\lambda(a_i, j) = 0$ for all $j>1$, then $P(a_i)$ is can only be either 0 or 1. Either $a_i$ can never be picked, or must be picked at every step after the first time it is picked.
The actual condition on $\lambda$... (I'll assume that every arm has a non-zero probability of being picked, i.e., that each arm $a$ has at least one delay $d$ such that $\lambda(a, d) > 0$)
Expected delay between two instances of picking an arm $a_i$ is given by: $$\langle d\rangle_{a_i} = \frac{\sum\limits_{d}~d~\lambda(a_i, d)}{\sum\limits_{d}~\lambda(a_i, d)}$$ The expected number of times arm $a_i$ appears in the long run = Number of steps / $\langle d\rangle_{a_i}$. Or $P(a_i) = 1/\langle d\rangle_{a_i}$. This can be written as $$\sum\limits_{d}~\lambda(a_i, d) = \frac{\sum\limits_{d}~\lambda(a_i, d)}{\sum\limits_{d}~d~\lambda(a_i, d)} \Rightarrow \sum\limits_{d}~d~\lambda(a_i, d) = 1$$
So, $\lambda(a, d)$ needs to satisfy the following two conditions: $$\sum\limits_{a, d} \lambda(a, d) = 1$$ $$\sum\limits_{d}~d~\lambda(a, d) = 1~~~~ \forall a$$
I can't see just yet whether or not either of your approaches will work if these constraints are satisfied by $\lambda$. I'll get back to you soon...
EDIT: The following should work to simulate arm selection:
The intuition for what follows is to look the current state as $(D_{a_1}, \dots, D_{a_K})$ which are the times it has been since $(a_1, \dots,a_K)$ were last picked. At each step, the system can be thought of as moving from one state to another.
The probability distribution of delay $d$ after which a given arm $a$ will be chosen is given by $$p(d|a) = \frac{\lambda(a, d)}{\sum\limits_{d'} \lambda (a, d')}$$ Let $\eta(a, D)$ be the probability that asymptotically, at a randomly chosen time (before making the selection for that step), it has been $D$ steps since $a$ was last chosen. It is given by (using the form of $p(d|a)$) $$\eta(a, D) = \left[\sum\limits_{d=1}^{\infty}\lambda(a, d)\right]\times \left[1 - \frac{\sum\limits_{d=1}^{D-1}\lambda(a, d)}{\sum\limits_{d=1}^{\infty}\lambda(a, d)}\right] = \sum\limits_{d=1}^{\infty}\lambda(a, d) - \sum\limits_{d=1}^{D-1}\lambda(a, d)$$ You can verify that the condition on $\lambda$ mentioned above ensures that $\sum_D \eta(a, D) = 1$ for all $a$. To clarify how $D$ is counted...
$\eta(a, 1)$ is the probability that $a$ was picked in the previous step
$\eta(a, 2)$ is the probability that $a$ was picked two steps ago, and wasn't picked in the last step, and so on.
If $\eta(a, D)$ is satisfied by the selection process for each $a$, then $\lambda(a, d)$ will be satisfied.
Solution starts here
Let $\mathbb{P}(D_{a_1},\dots, D_{a_K})$ be the probability of a given state of the system (asymptotically at a randomly chosen time) represented by the number of steps $D_{a_i}$ it has been since each arm $a_i$ was chosen. There are many distributions, $\mathbb{P}$, that'll work to reproduce $\eta$, but we can pick
$$\mathbb{P}(D_1, \dots, D_K) = \prod\limits_{a_i}\eta(a_i, D_{a_i}) = \prod\limits_{a_i}\left[\sum\limits_{d=1}^{\infty}\lambda(a_i, d) - \sum\limits_{d=1}^{D_{a_i}-1}\lambda(a_i, d)\right]$$
To simulate the next arm selection, see what state each of the $K$ selections will lead to ($D$ will become 1 for the chosen arm, and will increment by 1 for all others arms). At each step, choose arm $a$ with probability $$\mathrm{Prob}(a) = \frac{\mathbb{P}(\text{state resulting from choosing arm } a)}{\sum\limits_b{\mathbb{P}(\text{state resulting from choosing arm }b)}}$$ with $\mathbb{P}$ given above.
How to make the computation of $\mathrm{Prob}(a)$ computationally tractable:
If a closed form doesn't exist for the infinite sum, for each $a_i$ we can pre-compute $$\Lambda^*(a_i) = \sum\limits_{d=1}^{N}\lambda(a_i, d)$$ for some large $N$. During each generative step, update this by adding some more terms to the sum ($O(K)$ operations in each step, since there are $K$ such sums to be updated). In the asymptotic limit, $\Lambda^*(a_i)$ will become a good approximation to the infinite sum.
Similarly, at each step keep a table of the sum upto $D_{a_i}-1$ for each $a_i$. The possible changes to this sum caused by choosing any arm in the next step can then be computed in $O(K)$ operations (either add an extra term or reset to 0. Again, there are $K$ sums to be updated).
Having computed the possible changes to the summation terms in O(K) time, it is possible to calculate $\mathbb{P}$ for the possible states resulting from choosing any of the $K$ arms in $O(K)$ steps. The key here is to note that most of the terms in the product will be shared by all possible new states. In fact, it is possible to simplify $\mathrm{Prob}(a)$ by pulling these common terms out in the numerator and denominator and cancelling them, thereby eliminating the product computation entirely. Doing this will lead to the following:
Better solution
If the current state is $(D_{a_1}, \dots, D_{a_K})$, choose $a$ with probability proportional to $$\mathrm{Prob}(a) \sim \frac{\sum\limits_{d=1}^{\infty} \lambda(a, d)}{\sum\limits_{d=1}^{\infty} \lambda(a, d) - \sum\limits_{d=1}^{D_a} \lambda(a, d)}$$ normalized so that the sum of $\mathrm{Prob}$ over all arms is 1. The sums in this can be kept track of, as mentioned above, in $O(K)$ operations per step.
PS: It's been nearly 7 years since I was on this site, and this question made my morning. Thanks! And hopefully this solution works (my first attempt (which I didn't post) was wrong, so there's that)