Black Box optimization of expectation value from probability distribution valued function

88 Views Asked by At

So the setup is the following:

I have a probability distribution valued function that maps each point in $\mathbb{R}^n$ to a probability density function $f_x$ on $\mathbb{R}$. (actually just a finite number of points, but since any such probability density functions can be written as a sum of $\delta$'s it is easier to just assume that $f_x$ maps from $\mathbb{R}$)

$$ F : \mathbb{R}^n \ni \vec{x} \mapsto (f_{\vec{x}}:\mathbb{R} \to \mathbb{R}_+)) $$

I denote the expectation value of $f_\vec{x}$ by $E(\vec{x})$:

$$ E(\vec{x}) := \mathbb{E}(f_{\vec{x}}) = \int_\mathbb{R} y\, f_{\vec{x}} (y) \, \mathrm{d}y $$

For each $\vec{x}$ the probability density $f_\vec{x}$ is a black box to me and I can only learn about it by sampling, which is expensive. $E(\vec{x})$ is a smooth function.

I wish to minimize $E(\vec{x})$ as a function of \vec{x}, with taking as few samples of $F(\vec{x})$ as possible. That means both: I want to try as few different $\vec{x}$ as possible and for each fixed $\vec{x}$ take as few samples as possible to estimate the expectation value.

It seems clear to me, that when I am far from the minimum in a region with big gradients I don't need to know $E(\vec{x})$ as well as when I am close to it, hence in these regions I can take less samples to decide in which direction to take the next optimisation steps. But what is the best optimizer for this kind of problem and what is a good strategy to decide on the number of samples to take from $f_{\vec{x}}$ (probably depending on an estimate of the variance in the previous step and step height in the previous step)?

Does anyone know what I need to search for, to find papers about this topic or knows other resources to look at?

Some background

This kind of problem frequently pops up in variational quantum algorithms, where the objective function is e.g. the energy expectation value of a given hamiltonian, $\vec{x}$ is some set of parameters to prepare a state and $f_\vec{x}$ represents the probabilities to find certain energy eigenstates of the hamiltonian in the state prepared with the parameters $\vec{x}$.

edit

A similar problem also occurs in machine learning where computing the cost function on all of the training data is computationally expensive and hence the cost function is only evaluated on a subset of the training data. This subset is made bigger, as we approach the minimum and the gradients get smaller. An example of this can be found in

http://www.optimization-online.org/DB_FILE/2011/11/3226.pdf

Still my question stands, whether similar ideas have been proposed for other (gradient free) optimization algorithms.