I am reading a paper about the machine learning titled "Protection Against Reconstruction and Its Applications in Private Federated Learning" (https://arxiv.org/pdf/1812.00984.pdf).
Their formal setting is as follows. For a parameter space $\Theta\in \mathbb{R}^d$ and loss $l: \Theta\times\mathcal{X}\rightarrow \mathbb{R}_{+}$, they wish to solve the risk minimization problem: $$\text{minimize}_{\theta\in \Theta} L(\theta):=\mathbb{E}[l(\theta, X)]$$ The standard approach to such a problem is to construct the empirical risk minimizer $\hat{\theta}_n=\text{argmin}_{\theta}\frac{1}{n}\sum_{i=1}^n l(\theta, X_i)$.
The most popular rule is to apply a gradient update where from an initial model $\theta_0$ and for stepsize $\eta$, it applies $$\theta_0-\eta\frac{1}{m}\sum_{j=1}^m\nabla l(\theta_0, x_{i, j})=\text{argmin}_{\theta}\Big\{\frac{1}{m}\sum_{j=1}^m\langle\nabla l(\theta_0, x_{i, j}), \theta-\theta_0\rangle+\frac{1}{2\eta}\|\theta-\theta_0\|_2^2\Big\}$$ My main question is how to get the above transformation. Besides, the authors also claim that an alternative is to stochastic proximal-point-type updates, which update the above formula as $$\text{argmin}_{\theta}\Big\{\frac{1}{m}\sum_{j=1}^m l(\theta, x_{i, j})+\frac{1}{2\eta_i}\|\theta-\theta_0\|^2_2\Big\}$$ Another question is how to get this.
Any help is appreciated!
I have found the literature answering the first transformation: (https://www.eecis.udel.edu/~xwu/class/ELEG867/Lecture10.pdf)
Specifically, for the gradient descent $$w^{t+1}=w^{t}-\eta \nabla f(w^{t})$$ One can interpret gradient descent by looking at the Taylor approximation of $f$. At iteration $t$, the first order of $f$ around $w^{t}$ is given by: $$f(w)\approx f(w^{t})+\nabla f(w^{t})\cdot (w-w^{t})$$ and how tight this approximation is depends on how close $w$ is to $w(t)$. Hence, to minimize $f(w)$, we can jointly minimize the distance between $w$ and $w^t$, and the approximation of $f$ around $w^t$: $$w^{t+1}=\text{argmin}_{w}\underbrace{\frac{1}{2}\|w-w^{t}\|^2+\eta[f(w^t)+\nabla f(w^t)\cdot (w-w^t)]}_{(a)}$$
Taking the derivative of (a), i.e., $w-w^t+\eta\nabla f(w^t)$ and letting it be zero, we can get $w=w^t-\eta \nabla f(w^t)$ and it is the value to minimize (a). Therefore, according to the above $w^{t+1}=\text{argmin}_w(a)$, we have $w^{t+1}=w=w^t-\eta\nabla f(w^t)$. Based on this, we obtain the gradient descent.
Contents for the second question. Using first-order approximation. Replace $f$ with its first-order approximation, it has: $$x_{t+1}=\text{argmin}_{x}\Big\{f(x_t)+\langle\nabla f(x_t), x-x_t\rangle+\frac{1}{2\eta_{t+1}}\|x-x_t\|^2\Big\}$$ Writing the optimality condition, that is, when the derivative equals zero,
one quickly notices that the above formula actually leads to gradient descent: $$x_{t+1}=x_t-\eta_{t+1}\nabla f(x_t)$$
From: https://epubs.siam.org/doi/pdf/10.1137/1.9781611977066.9