I'm trying to use automatic differentiation to obtain the gradient, Hessian and compare them with the analytic ones. However when I set step-size very small, the Hessian blows up and the gradient becomes less precise. Why the step-size not the smaller the better? If so, what should be a reasonable step-size for such approximation?
# log_likelihood
mu = 54.2
x = c(37.7,52.5,49,51,49.9,49.8,50.4,49.8,49.1,49.2)
sample_size = length(x)
log_likelihood = function (mu) {
log_likelihood_value = 0
for (i in 1:sample_size) {
log_likelihood_value = log_likelihood_value + log(pi*(1+(x[i]-mu)^2))
}
return(-log_likelihood_value)
}
cat('log_likelihood=',log_likelihood(mu))
# Newton-Raphson
gradient = function (mu) {
gradient_value = 0
for (k in 1:sample_size) {
gradient_value = gradient_value + 2*(x[k]-mu)/(1+(x[k]-mu)^2)
}
return(gradient_value)
}
hessian = function (mu) {
hessian_value = 0
for (k in 1:sample_size) {
hessian_value = hessian_value + (2*(1+(x[k]-mu)^2)-4*(x[k]-mu)^2)/(1+(x[k]-mu)^2)^2
}
return(-hessian_value)
}
mu = median(x)
N = 5
eps = 0.0000000000001
for (j in 1:N) {
cat('mu=',mu,'\n')
cat('gard_approx=',(log_likelihood(mu+eps)-log_likelihood(mu-eps))/(2*eps),'\n')
cat('grad=',gradient(mu),'\n')
cat('hessian_approx=',(log_likelihood(mu+eps)-2*log_likelihood(mu)+log_likelihood(mu-eps))/eps^2,'\n')
cat('hessian=',hessian(mu),'\n')
mu = mu - gradient(mu)/hessian(mu)
}
and I obtain result
mu= 49.8
gard_approx= -0.2309264
grad= -0.2463615
hessian_approx= 1.065814e+12
hessian= -7.707962
mu= 49.76804
gard_approx= -0.01776357
grad= 0.001049172
hessian_approx= 355271367880
hessian= -7.770681
mu= 49.76817
gard_approx= -0.01776357
grad= 1.540493e-08
hessian_approx= 355271367880
hessian= -7.770453
mu= 49.76817
gard_approx= 0.03552714
grad= 1.24345e-14
hessian_approx= 710542735760
hessian= -7.770453
mu= 49.76817
gard_approx= 0.03552714
grad= 1.24345e-14
hessian_approx= 710542735760
hessian= -7.770453
The evaluation of a function via floating point incurs usually a relative error that is a small random multiple of the machine constant (which is usually denoted as $\mu$, but let's use $\delta$ here). The error when using divided differences with step size $\varepsilon$ is then bounded by an expression of the form $$ \frac{4|f(x)|δ}{ε^2}+\frac{|f^{(4)}(x)|ε^2}{12} $$ As you can see, this is not only large for large $ε$, but also for very small $ε$. It is minimal around the point where both terms in the sum are equal, and assuming that function value and derivative are of equal magnitude, this minimal error bound can be found around $ε=\sqrt[4]δ$ which for 64bit double floats gives $δ=2\cdot 10^{-16}$ and thus $ε=10^{-4}$.
It might seem counter-intuitive, but including the floating point errors one finds that the higher the error order of the formula, the larger the $ε$ with the optimal error bound.