Does it make sense to use Newton-Raphson learning rate in Stochastic Gradient Descent?

1k Views Asked by At

Stochastic gradient descent updates with learning rate $\eta$ as follows $$w:=w-\eta \nabla f_i(w),$$ where $f_i(w)$ is the objective function evaluated at one data point. SG than iterates over randomly selected data points until convergence but with constant learning rate.

An advantage of Newton-Raphson is that the learning rate is determined by the Hessian $H(w)$ leading to update rule $$w:=w-H(w) \nabla f(w),$$ where $f(w)$ is usually evaluated across all data points. This turns NR typically more efficient than gradient descent.

It seems it would make most sense to combine both methods to $$w:=w-H(w) \nabla f_i(w).$$ Are there strong arguments against doing so or why do I never see this done (my context is statistics / machine learning)?