Algorithms for Stochastic Continuous Optimization

148 Views Asked by At

Question

I have a continuous optimization problem of the form $$ \max_{x \in \mathbb R^n} f(x), $$ where $f:\mathbb R^n \rightarrow \mathbb R$ is mostly smooth and bounded above. The standard approximation algorithms for these kinds of problems, such as basinhopping gradient ascent, take numerical estimates of the derivatives of $f$. But I can't compute $f$ directly; I can only observe it through noise. Random noise will make the numerical derivatives vary wildly, so I can't use any regular derivative-based optimization algorithm.

What are the best optimization algorithms that avoid this problem? I'm considering simulated annealing, which uses no calculus whatsoever. But I have to imagine that there are differential techniques which are robust to noise in the objective function. Does it help if I know/can control the approximate noise scale?

EDIT: Context

I didn't initially say I'm working on since it isn't essential to my question, but now I'll add a precise problem statement for those curious. I'm trying to capture as much of Gaussian-distributed population as I can with a fixed number of unit balls. The objective function is the probability mass of the union of the balls. Formally, the problem is $$ \max_{p_1,\ldots,p_N \in \mathbb{R}^n} P\left\{1 \geq \min_{n \in [N]} ||x - p_n||_2 \right\} $$ where $x \sim N(\mu, \Sigma)$ is a Gaussian-distributed vector.

So in my case, $f$ is an aggregation of a Gaussian PDF over a rather ugly domain of integration. My intuition tells me $f$ is non-concave, but continuous everywhere and smooth almost everywhere. Yet, I don't think it is possible to compute (or smoothly approximate) this with the efficiency required to support thousands of queries from an optimizer. But a relatively good noisy estimate of $f$ can be obtained quickly by simply sampling a large number of points from $N(\mu, \Sigma)$ and taking the proportion contained by the balls.

1

There are 1 best solutions below

5
On

I recommend looking into Simultaneous Perturbation Stochastic Approximation (SPSA).

Very recently, I also happened to be writing an implementation in Python, which you can now view on GitHub. The documentation for the main optimize function goes in-depth about the features it provides, but I'll just provide some of the main ones here.

SPSA doubles as both an optimization and a gradient approximation algorithm. As the name implies, it is capable of handling stochastic problems (you can see the examples.py on my GitHub). It is similar to gradient descent except for the fact that it estimates the gradient using function evaluations instead.

With regards to your concerns on the influence of noise on the gradient approximation, you may use large perturbations to reduce the impact of noise. Usually large perturbations are used initially, and they gradually diminish to provide more accurate average gradient approximations. For more information, you can read the documentation of the optimize function under:

  • Parameters.HyperParameters.Perturbation
  • OtherFeatures.Momentum
  • OtherFeatures.GradientNormalization

I will also suggest keeping your end of the computations to a minimum. The example.py shows an example of polynomial regression, where every function evaluation is a completely random point, with additional noise on top of that. The gradient is estimated from two perturbed coefficient vectors and at two random points over the domain. Despite how bad you might imagine this to be, the plots show that it converges quite well, or at least to an extremely reasonable extent.