quasi-Newton methods with Monte Carlo sampling

97 Views Asked by At

I have an optimisation problem of the form $$\text{argmax}_{\theta} \ L(\theta),$$ where $$L(\theta) = \ \mathbb{E}_{X\sim p(\cdot)}\left[f(X,\theta)\right],$$ and where $p(\cdot)$ is a distribution from which it is easy to draw samples (say standard gaussian), $f$ is a (deterministic) function and $\theta$ is a vector of $3$-$5$ parameters.

I am using the L-BFGS algorithm (as implemented in torch), with automatic differentiation on top of Monte Carlo sampling to estimate the function, the gradients ( and consequently, the hessian):

$$L(\theta) \approx \frac{1}{n} \sum_{i=1}^n f(x_i,\theta) $$ $$\nabla_\theta L(\theta) \approx \frac{1}{n} \sum_{i=1}^n \nabla_\theta f(x_i,\theta)$$
where $x_1,\ldots,x_n $ are iid samples from $p(\cdot).$

Every 100 iterations (say), I can use more samples, estimate the variance of the gradient estimator and re-calibrate the number of samples used in the Monte Carlo step. An example thumb rule would be: use enough samples so that the st dev of the estimator of the gradient is <= 10% of the estimated value of the gradient.

Are there any such common heuristics? Similarly, are there any stopping heuristics for the case where the gradients, hessians (and function itself) are estimated stochastically?

EDIT: I use 500-2500 data points, and don't batch the data. The gradient estimates are quite noisy, so I feel using batches would make this worse.