Optimal control of posterior belief over a finite horizon

Question

Optimal control of posterior belief over a finite horizon

99 Views Asked by Bumbble Comm At 14 May 2026 - 3:24

$\large \textbf{Preface:}\ $ Below I describe a dynamic programming problem I am not sure how to formalize. In short: a (Bayesian-updating) agent sequentially runs costly experiments over a finite number of periods before predicting an unknown. The agent essentially wants to "control" the posterior belief they will have at the time they have to make the prediction. (It is sort of like a finite-horizon bandit problem but with $\textit{only}$ a terminal payoff).

The problem is described in detail below; this is followed by my questions and my progress thus far. I employ footnotes (denoted $^{[n]}$, $n\in\mathbb{N}$) to "declutter'' while preserving details you may find useful.

Thank you very much to anyone who takes the time to consider my question. Please let me know if any clarification/further detail is needed.

$\large \textbf{Problem Description} $ This problem essentially considers an $\textit{agent}$ who learns about a random variable ("state") $Y$ over $T\in\mathbb{N}$ time periods. First, $Y\sim\text{Bernoulli}(\mu)$ is drawn; the agent does not observe the realization $y\in\{0,1\}$ of $Y$ but has a "correct" prior belief.$^{[1]}$ Each $\textit{period}$ $t\in\{1,2,...,T\}$ has the following timing:

First, the agent chooses $\alpha_t\in[0,1]$ at cost $c(\alpha_t)$.$^{[2]}$
Agent observes realization $z_{t}\in\{0,1\}$ of random variable ("signal") $Z_t$ with conditional pmf $\psi(\cdot|y,\alpha_t)$. Assume that \begin{equation} \psi(1|y,\alpha_t) = \text{Prob}\{Z_t=1|Y=y,\alpha_t\} = \begin{cases} \frac{1}{2}-\left(\beta-\frac{1}{2}\right)\alpha_t , & \text{if } y = 0\\ \frac{1}{2}+\left(\beta-\frac{1}{2}\right)\alpha_t , & \text{if } y = 1 \end{cases}, \qquad (1) \end{equation} where $\beta\in (.5,1)$; of course, $\psi(0|y,\alpha_t)=1-\psi(1|y,\alpha_t)$.$^{[3]}$
Agent forms posterior "belief" (distribution) about $Y$ using Bayes' rule. Denote this by $b_t(\cdot|\boldsymbol{h}_t)$, where $\boldsymbol{h}_t \equiv (\alpha_1,z_1,...,\alpha_t,z_t)$ is the "history" of past signal realizations and $\alpha_{\bullet}$ choices. (In an abuse of notation let $h_0\equiv\emptyset$.)

Finally, in period $T+1$, the agent chooses an $\textit{action}$ $a\in\{0,1\}$ and earns payoff $u_{a y}$.$^{[4]}$ Assume that $\min\{u_{11},u_{00}\}>\max\{u_{01},u_{10}\}$ (i.e. wants to "match the state"). Given belief $b_T(\cdot|\boldsymbol{h}_T)$, their expected payoff is \begin{equation} W(a;b_T(\cdot|\boldsymbol{h}_T)) := \mathbb{E}_{_{Y\sim b_T(\cdot|\boldsymbol{h}_T)}} u_{aY} = b_T(0|\boldsymbol{h}_T)u_{a0}+b_T(1|\boldsymbol{h}_T) u_{a1}. \qquad (2) \end{equation}

The agent does not discount time. Their objective it to maximize their expected terminal payoff. What this means for their choice of $a$ at time T is easy (choose $a$ to maximize (2) given belief; solution given in (3)). What is giving me trouble is formulating the part of the problem before time $t$.

$\large \textbf{My Questions:} $ how does one formulate the above problem as a dynamic optimization problem? Is a (closed-form) solution known?$^{[5] \ \leftarrow\ \text{(possible “quality of life" improvements)} }$

My attempt at answering the first question is in equation (5), below.

An extra big thank you to anyone who made it this far!

$\ $

$\large \textbf{My Attempt} $ First off, the optimal choice of $a$ at time $T+1$ is very easy to characterize given belief $b_T\equiv b_T(\boldsymbol{h}_T)$. Recall that the probability of event "$Y=1$'' assigned by belief $b_T$ is denoted by $\mu_T \equiv \mu_T(\boldsymbol{h}_T)$. Then optimal choice $a^*(\mu_T)$ is given by \begin{equation} a^*(\mu_T) = \begin{cases} 0, & \text{ if } \mu_T < \frac{u_{00}-u_{10}}{(u_{11}-u_{01})+(u_{00}-u_{10})}\\ 1, & \text{otherwise}^{[6]} \end{cases} \qquad (3) \end{equation} which yields expected payoff \begin{equation} W^*(\mu_T) = \begin{cases} \left[1-\mu_T\right]u_{00}+\mu_T u_{01}, & \text{ if } \mu_T < \frac{u_{00}-u_{10}}{(u_{11}-u_{01})+(u_{00}-u_{10})}\\ \left[1-\mu_T\right]u_{10}+\mu_T u_{11}, & \text{otherwise} \end{cases} \qquad (4) \end{equation}

$\boldsymbol{Now}\ \boldsymbol{for}\ \boldsymbol{the}\ \boldsymbol{hard}\ \boldsymbol{part}.$ Let $\boldsymbol{\alpha}_{r:s} := (\alpha_r,...,\alpha_s)\in[0,1]^{r-s}$ for $1\leq r < s \leq T$. Similarly define $\boldsymbol{Z}_{r:s}$ and $\boldsymbol{z}_{r:s}$, the vector of signals and realizations from time $r$ to $s$, respectively. Given choices $\boldsymbol{\alpha}_{1:t}$ and observations $\boldsymbol{Z}_{1:t}=\boldsymbol{z}_{1:t}$, the agent believes "$Y=1$'' with probability $$ \mu_t(\alpha_1,Z_1,...,\alpha_t,Z_t) = \frac{\mu \prod_{\tau=1}^{t} \psi(Z_{\tau}|1,\alpha_\tau)}{\mu \prod_{\tau=1}^{t} \psi(Z_{\tau}|1,\alpha_\tau)+(1-\mu) \prod\limits_{\tau=1}^{t} \psi(Z_{\tau}|0,\alpha_\tau)} =\frac{1}{1+\frac{1-\mu}{\mu} \prod_{\tau=1}^{t} \frac{\psi(Z_{\tau}|0,\alpha_\tau)}{\psi(Z_{\tau}|1,\alpha_\tau)} }$$ I $\textit{think}$ the above equation is correct, and that it can be rewritten recursively as $$ \mu_t(\boldsymbol{h}_{t}) = \frac{\mu_{t-1}(\boldsymbol{h}_{t-1}) \cdot \psi(z_{t}|\alpha_t,1) }{\mu_{t-1}(\boldsymbol{h}_{t-1})\cdot\psi(z_{t}|\alpha_t,1) + [1-\mu_{t-1}(\boldsymbol{h}_{t-1})] \psi(z_{t}|\alpha_t,0) } = \frac{1}{1+\frac{1-\mu_{t-1}(\boldsymbol{h}_{t-1})}{\mu_{t-1}(\boldsymbol{h}_{t-1})} \frac{\psi(z_{t}|0,\alpha_\tau)}{\psi(z_{t}|1,\alpha_\tau)} } .$$ In addition, the following also seems to be true. $$ \mu_t(\boldsymbol{h}_{t}) = \frac{\mu_s(\boldsymbol{h}_s) \prod_{\tau = s+1}^{t} \psi(Z_\tau|1,\alpha_\tau)}{\mu_s(\boldsymbol{h}_s) \prod_{\tau = s+1}^{t} \psi(Z_\tau|1,\alpha_\tau)+ [1-\mu_s(\boldsymbol{h}_s)] \prod_{\tau = s+1}^{t} \psi(Z_\tau|0,\alpha_\tau)} = \frac{1}{1+\frac{1-\mu_s(\boldsymbol{h}_s)}{\mu_s(\boldsymbol{h}_s)}\prod_{\tau=s+1}^t \frac{\psi(Z_\tau|0,\alpha_\tau)}{\psi(Z_\tau|1,\alpha_\tau)} } $$ where $1\leq s<t\leq T$.

This is the point at which I have difficulty making progress. I am not sure how to write the belief over $\mu_T$ from the agent's perspective at the beginning$^{[7]}$ of time $t$; denote this by $\eta_t(\cdot|{\color{red} {\textbf{unsure}} })$. Moreover, I am simply unsure how to write the whole dynamic optimization problem, or make it recursive. My best attempt so far is as follows:

For each $t\in\{1,...,T\}$, let \begin{equation} \begin{array}{rcl} V_t({\color{red} {\textbf{unsure}} }) & = & \max\limits_{\alpha_t,...,\alpha_T} \mathbb{E}_{\mu_T \sim \eta_{t}(\cdot|\mu_{t-1}, \alpha_t,...,\alpha_T)} \left[ W^*(\mu_T) \right] - \sum_{\tau = t}^{T} c(\alpha_\tau) \\ & s.t. & \ \alpha_\tau \in [0,1]\ \ \forall \tau\in\{t,...,T\} \\ & & \\ \end{array} \qquad (5) \end{equation} $\ $

$\large \textbf{Footnotes} $
$\quad$ [] I.e. the agent's prior (regarding $Y$) is $\text{Prob}\{Y=y\}\begin{cases} 1-\mu, & \text{if } y=0\\ \hfill \mu, & \text{if } y=1 \end{cases}$.
$\quad$ [2] Assume $c:[0,1]\to[0,\infty)$ is $\mathscr{C}^2$, strictly monotonically increasing and strictly convex
$\quad$ $\quad $ $\big(c'(\alpha_{\cdot}),c^{\prime \prime}(\alpha_{\cdot})>0\ \forall \alpha_{\cdot}\in [0,1]\big)$, and $c(0)=0$.
$\quad$ [3] $Z_\tau$ is assumed to be conditionally independent of $Z_{\tau'}$ ($\tau\neq \tau'$) given $\alpha_\tau$ and $Y$.
$\quad \quad \ \ $ Assume $\psi(\cdot|\cdot,\cdot)$ is known by the agent.
$\quad$ [4] $\textbf{Alternative Payoff Function:}$ If the "binary choice" setup makes things difficult,
$\quad \quad \ \ $ assume instead that the agent chooses $a\in[0,1]$ and earns payoff $u(a,y):=-(a-y)^2$;
$\quad \quad \ \ $ expected payoff has the analogous form to (2).
$\quad$ [5] If the answer to the second question is "no", would life be a lot easier if
$\qquad \quad$ $\text{i}.\ $ terminal time $T$ was $\textit{random}$? (E.g. at the end of each period, the agent is forced to stop
$\qquad \quad \quad\ $ learning and choose $a$ with probability $\delta\in(0,1)$.)
$\qquad \quad$ $\text{ii}.\ $ Signals $Z_\bullet$ were $\textit{continuous}$ r.v.'s?
$\qquad \quad$ $\text{iii}.\ $ time was $\textit{continuous}$?
$\qquad \quad$ $\text{iv}.\ $ $T$ is a $\textit{stopping time}$ optimally chosen by the agent?
$\quad$ [6] Technically indifferent if equality holds; assume for simplicity that $a^*(\cdot)=1$ in this case.
$\quad$ [7] I.e. before $\alpha_t$ is chosen.

Original Q&A

Optimal control of posterior belief over a finite horizon

Related Questions in PROBABILITY-THEORY

Related Questions in RECREATIONAL-MATHEMATICS

Related Questions in BAYESIAN

Related Questions in OPTIMAL-CONTROL

Related Questions in DYNAMIC-PROGRAMMING

Trending Questions

Popular # Hahtags

Popular Questions