Taylor Series Approximation of the Variance Function

50 Views Asked by At

The problem setting is offline learning from bandit feedback data. Given a context vector $x$, a policy chooses an action $a$, with policy defined as $h_w(y \vert x)$, where $w$ is the 'learnable' parameter for the policy. The offline data is collected using the logging policy $h_0(y \vert x)$. The objective is to optimize the mean of the following terms $u_{w}^i = \delta_i \frac{h_w(y_i \vert x_i)}{h_0(y_i \vert x_i)}$. I am following a paper, "Counterfactual Risk Minimization: Learning from Logged Bandit Feedback", which adds a sample variance term to the objective defined above, $\sqrt{\text{Var}(u)}$.

Now the authors approximate the $\sqrt{\text{Var}(u)}$ term using the first-order Taylor-series approximation around the parameter $w$ (Section 5.1, proposition 1), which they define as:

$\sqrt{\text{Var}(u)} \leq A_{w_0} \sum_{n=1}^{n} u_{w}^i +B_{w_0} \sum_{n=1}^{n} (u_{w}^i)^2 + A_{w_0} $.

I cannot derive the taylor series approximation which matches the term derived in the paper. In my understanding, since the taylor series approximation is w.r.t the parameter $w$, there should be $w$ and $w_0$ terms in the expansion, not sure how $u_{w}^i$ terms appear in the derivation.