How to implement Newton's method for optimization in a reliable way.

341 Views Asked by At

I'm trying to implement Newton's method for optimization. I want to find the local minima of a non-convex function (specifically a negative log likelihood) of $m$ variables where $m\sim 10$.

Here is my formulation:

$\mathbf{x_t}\in \mathbb{R}^m$

$H(\mathbf{x_t}) = \nabla^2f(\mathbf{x_t})$ (Hessian matrix computed at point $\mathbf{x_t}$ for function $f$)

$\mathbf{g}(\mathbf{x_t}) = \nabla f(\mathbf{x_t})$ (Gradient vector computed at point $\mathbf{x_t}$ for function $f$)

Update rule:

$$ \mathbf{x}_{t + 1} = \mathbf{x}_t - H^{-1}(\mathbf{x_t})\cdot \mathbf{g}(\mathbf{x_t}) $$

Since my function is non-convex I have to check at each time step if $H(\mathbf{x_t})$ is definite-positive (i.e. if it is locally convex) or not and , in that case, compute the update.

My actual implementation has the following pseudo-code:

x_new = SOME_INITIAL_POINT
ftol = 1e-5
fdiff = + INF

while (fdiff > ftol)

    x_old = x_new
    g = gradient(x_old)
    H = hessian(x_old)

    if isPositiveDefinite(H)
        x_new = computeNewtonRaphsonUpdate(x_old,H,g)
    else
        x_new = computeGradientDescentUpdate(x_old,g)

    fdiff = abs(f(x_new) - f(x_old))
        

This implementation kinda works but it is extremely slow when used with some datasets, this is due to the fact that most of the time we end up in the Gradient Descent branch and the learning rates that I use are not optimal for each dataset I'm trying to test the code with. Moreover there's a huge scale difference between first parameter (which can assume values between $\sim 200$ and $\sim 18000$) and the others (which are just AR coefficients between e.g. $-2.5$ and $2.5$). For instance a "good" set of learning rates I use is $50$ for the first parameter and $0.0000005$ for the others. Of course this is not the way to go since I can't just hardcode this learning rates.

So, given that the computation of both the gradient vector and hessian matrix is correct ( I tested both with finite-difference approximations), I have two questions:

  • In my actual implementation, sometimes even if the eigenvalues are positive the update rule gives me a new point $\mathbf{x}_{t+1}$ where $f(\mathbf{x}_{t+1}) > f(\mathbf{x}_{t})$. Could this happen or it is likely a numerical problem of my implementation? (e.g. some value overflowed / underflowed)

  • Here is instead my main question: I saw that there are some ways of adjusting the Hessian in the Compute Newton's direction section on Wikipedia in order to always make it definite-positive, what would be the best way to modify my Hessian and handle my scale-related problems in this specific case? I would need really reliable results with as few iterations as possible.

Edit: Apparently by adding the line-search approach proposed by @LinAlg most of the time we end up in the Newton method, however since it always chooses a learning rate of e.g. 0.002 the number of iterations needed to find the local minimum is $\sim 800$.