quasi-Newton methods with Monte Carlo sampling

97 Views Asked by Bumbble Comm At 30 Mar 2026 - 4:30

I have an optimisation problem of the form $$\text{argmax}_{\theta} \ L(\theta),$$ where $$L(\theta) = \ \mathbb{E}_{X\sim p(\cdot)}\left[f(X,\theta)\right],$$ and where $p(\cdot)$ is a distribution from which it is easy to draw samples (say standard gaussian), $f$ is a (deterministic) function and $\theta$ is a vector of $3$-$5$ parameters.

I am using the L-BFGS algorithm (as implemented in torch), with automatic differentiation on top of Monte Carlo sampling to estimate the function, the gradients ( and consequently, the hessian):

$$L(\theta) \approx \frac{1}{n} \sum_{i=1}^n f(x_i,\theta) $$ $$\nabla_\theta L(\theta) \approx \frac{1}{n} \sum_{i=1}^n \nabla_\theta f(x_i,\theta)$$
where $x_1,\ldots,x_n $ are iid samples from $p(\cdot).$

Every 100 iterations (say), I can use more samples, estimate the variance of the gradient estimator and re-calibrate the number of samples used in the Monte Carlo step. An example thumb rule would be: use enough samples so that the st dev of the estimator of the gradient is <= 10% of the estimated value of the gradient.

Are there any such common heuristics? Similarly, are there any stopping heuristics for the case where the gradients, hessians (and function itself) are estimated stochastically?

EDIT: I use 500-2500 data points, and don't batch the data. The gradient estimates are quite noisy, so I feel using batches would make this worse.

Original Q&A

quasi-Newton methods with Monte Carlo sampling

Related Questions in MACHINE-LEARNING

Related Questions in NUMERICAL-OPTIMIZATION

Related Questions in MAXIMUM-LIKELIHOOD

Related Questions in GRADIENT-DESCENT

Related Questions in MONTE-CARLO

Trending Questions

Popular # Hahtags

Popular Questions