While trying to understand the fundamental concepts in control theory reading the following article Dual Control for Approximate Bayesian Reinforcement Learning (chapter 3.1, "A toy problem") i came across the following solutions to a very simple problem:
Consider the linear, scalar system:
$x_{k+1} = ax_k + bu_k + \xi_k$
Where as $x_k$ denotes the state at timestep $k$, $u_k$ the control action at timestep $k$ and $\xi$ is normally distributed.
Consider the following cost function: $L(x, u) = [\sum_{k=0}^T (x_k - r_k)^T W_k (x_k - r_k) + \sum_{k=0}^{T-1}u_k^TU_ku_k]$ where as $r = [r_0...r_T]$ is a target trayectory. $W_k$ and $U_k$ define state and control cost respectively.
If $a$ and $b$ are known, the optimal $u_k$ to drive the current state $x_k$ to zero in one step can be trivially verified to be $u_{k, oracle}^* = -\frac{abx_k}{U + b^2}$
Let now parameter $b$ be uncertain, with current belief $p(b) \sim N(b; u_k, \alpha_k^2)$ at time $k$. The naive option of simply replacing the parameter with the current mean estimate is known as certainty equivalence (CE) control in the dual control literature. The resulting control law is $u_{k, ce}^* = -\frac{au_kx_k}{U + u_k^2}$
How can i derive these solutions?