I'm looking for some alternative approaches to online\stochastic gradient descent for online optimization such that
1) there exists some proof about the convergence of the parameters to some compact set
2) asymptotically the parameters are able to achieve some good performance (with respect to a given optimality criteria), which can consists also in a local minima.
edit: (more details)
The problem in question is a multi-input-single-output identification problem, in which I have a continuous function $y:H\subset\mathbb{R}^d\to\mathbb R$ depending on some variable $x\in H$. The goal is to create a good model for $y$ once known some samples of the variable $x$ and the corresponding value $y(x)$.
In other words, let $\mathcal M_\theta$ be a class of possible models, in which each member $m_\theta : H \to \mathbb{R}$ (which is a suitable model for $y$) is indexed by a set of parameters $\theta\in\mathbb{R}^m$. Let $Z_N =\{(y(x_i),x_i)\mid \,x_i\in\mathbb{R}^d\,,i=\,\dots,N\}$ be a set of $N$ measurements and let $V_{Z_N}(\theta):\mathbb{R}^m\to\mathbb R_{\ge 0}$ be a positive function of $\theta$ measuring the fitness of the model indexed by $\theta$ to the training set $Z_N$, then the goal is to find the "best" (in the sense of $V_{Z_N}$) model in $\mathcal{M}_\theta$. The model selection problem can be cast as the following optimization problem
$$
\hat{\theta} = \arg\min_{\theta\in\mathbb{R}^m} V_{Z_N}(\theta)
$$
where $V_{Z_N}(\theta)$ for instance can be taken as the sum of the squares of the prediction error along the dataset $Z_N$:
$$
V_{Z_N}(\theta) = \dfrac{1}{N} \sum_{i=1}^N \left\| y(x_i) - m_\theta(x_i)\right\|^2
$$
When the data set $Z_N$ is known in advance one can use all the optimization methods he wants, however in my case new elements of $Z_N$ comes "online", in the sense that my predictor $m_\theta$ lives in an hybrid time domain in which st given time instant he obtains measurements of a underlying continuous process and he must learn sample by sample a good model for such process.
At any sampling time a new sample $(y(x_i),x_i)$ arrive, and in an infinite time horizon $Z_N$ can thus grow indefinitely. For this reason even if at each time instant the whole $Z_N$ is known, it can be too big to apply some optimization which uses at each step all the samples. For this reason I'm interested in some estimation law that only used very few of the recent samples at each step, possibily relying of the previous estimate.
In particular I'm interested in the following problem: given an initial guess for the optimal parameters $\hat{\theta}_N$, find a recursion law of the kind $\hat{\theta}(i) = f(\hat{\theta}(i-1),y(i))$ which guarantees that the trajectory of $\hat{\theta}(i)$ converges to some compact set $\Theta$ and the restriction of $V_{Z_N}$ to $\Theta$ is a neighborhood of some local minima of $V_{Z_N}$, possibly as small as one wants by properly chosing $f(\cdot)$.