Wald meets Weitzman/Gittins

107 Views Asked by At

Suppose a searcher is faced with $n$ objects. Each object's quality is an i.i.d. Bernoulli random variable $\Theta_{i}$, $i = 1, \dots, n$, where $\mathbb{P}\left(\Theta_{i} = 1\right) = \mu_{0}$.

Time is continuous, and each instant the searcher may sample from an object: she chooses object $i$ and observes a process $(Z^{i}_{t})_{t\geq 0}$. For all $i$, when object $i$ is being sampled, the change in the process $Z^{i}_{t}$ is the sum of the state and a noise term, which is the increment of a Brownian motion $(W_{t})_{t \geq 0}$: $$dZ^{i}_{t} = \theta_{i} dt + \sigma dW^{i}_t$$

For simplicity, I've assumed the volatility for each object $i$ is identical, $\sigma$ (more generally, each would have volatility $\sigma^{i}$). Let $\mu^{i}_{s}$ denote the agent's posterior belief that object $i$ has quality $1$: $$\mu_{i}^{s} := \mathbb{P}\left(\Theta_{i} = 1|\left(Z^{i}_{s}\right)_{s \leq t}\right)$$ If object $i$ is not sampled in the interval $[s,s']$, $\mu^{i}_{s} = \mu^{i}_{s'}$ i.e. she does not learn about its quality. Thus, she can only learn about one object per instant.

Each instant the agent must decide whether to sample some object $i$ or to stop and select one of the objects. Selecting an object ends the scenario. When sampling an object $i$ the agent incurs a bounded (positive) flow cost $c\left(\cdot\right)$, which depends on her posterior belief i.e. if she samples from box $i$ between times $s$ and $s'$, she pays cost $$\int_{s}^{s'}c\left(\mu^{i}_{t}\right)dt$$ More generally, each object could have a different flow cost, $c^{i}\left(\cdot\right)$, but for now I suppose that they're all the same.

The agent's payoff from stopping and selecting object $i$ at time $\tau$ is $$\mu^{i}_{\tau} - \sum_{j=1}^{n}\int_{0}^{\tau}c\left(\mu_{t}^{j}\right)dt$$

The classic Wald (1945) problem has just one object and a constant flow cost. This set-up is also clearly related to the multi-arm bandit problem, in which each instant/period an arm is selected and a flow payoff is received. Here, each instant and arm is selected and "flow learning" occurs.

Also closely related is the sequential search problem of Weitzman (1979).

Here are my questions:

  1. Has this problem been studied? This seems likely, since it seems like a natural question and it is closely related to two rich literatures: the multi-arm bandit literature, and the literature on Wald problems.
  2. Is an index policy optimal?

I'd appreciate any references or suggestions.