I was testing my analytical solution to the following term w.r.t the parameter $k$ and $\mu$.
$$ \mathcal{L} = \color{green}{- \eta_n\log\left(1-P(t_n)\right)} $$
$P(t_n)$ is the cdf of the Inverse Gaussian distribution
$$ P(t) =Pr(X< t)= \Phi\left(\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}-1\right)\right)+e^{\frac{2k}{\mu}}\Phi\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right) $$ Where $\Phi$ is the standard normal distribution cdf.
However, by comparing my solution with an approximated gradient given by scipy.optimize.approx_fprime I realised that for the same point $[k_0,\mu_0]$ the derivatives are wrong, specifically for $\mu$ it seems that I forgot a $\times 2$ factor, while for $k$ both the sign and the value are different. (I give more details at the end of the question)
Here are my notes:
(Note that I computed everything in C with mpfr, so there shouldn't be any numerical error.
$$ \frac{\partial}{\partial P(t_n)} \left(-\eta_n\log\left(1-P(t_n)\right)\right) = \\ \frac{\partial \left(-\eta_n\log\left(1-P(t_n)\right)\right)}{\partial \left(1-P(t_n)\right)} \cdot \frac{\partial \left(1-P(t_n)\right)}{\partial P(t_n)} =\\ -\frac{\eta_n}{1-P(t_n)}\cdot(-1)=\frac{\eta_n}{1-P(t_n)} $$
$k$
I rewrite here the definition of $P(t)$ to make the notes more readable:
$$ P(t) =Pr(X< t)= \Phi\left(\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}-1\right)\right)+e^{\frac{2k}{\mu}}\Phi\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right) $$
$$ \frac{\partial}{\partial k}P(t)=\\ \Phi^{'}\left(\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}-1\right)\right) \color{blue}{\frac{\partial}{\partial k}\left(\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}-1\right)\right)} + \\ \color{green}{\frac{\partial}{\partial k}\left(\frac{2k}{\mu}\right)}e^{\frac{2k}{\mu}}\Phi\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right) + \\ \Phi^{'}\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right) \color{purple}{\frac{\partial}{\partial k}\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right)} e^{\frac{2k}{\mu}} \\= \\ \Phi^{'}\left(\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}-1\right)\right) \color{blue}{\left(\frac{1}{2t\sqrt{\frac{k}{t}}}\left(\frac{t}{\mu}-1\right)\right)} +\\ \color{green}{\frac{2}{\mu}}e^{\frac{2k}{\mu}}\Phi\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right) + \\ \Phi^{'}\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right) \color{purple}{ \left(-\frac{1}{2t \sqrt{\frac{k}{t}}}\left(\frac{t}{\mu}+1\right)\right)} e^{\frac{2k}{\mu}} $$
Which can be rewritten as $$ \frac{\partial}{\partial k}P(t)= \\ \Phi^{'}\left(\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}-1\right)\right) \color{blue}{\left(\frac{1}{2t\sqrt{\frac{k}{t}}}\left(\frac{t}{\mu}-1\right)\right)} +\\ \color{green}{\frac{2}{\mu}}e^{\frac{2k}{\mu}+\log\left(\Phi\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right)\right)} + \\ \color{purple}{ \left(-\frac{1}{2t \sqrt{\frac{k}{t}}}\left(\frac{t}{\mu}+1\right)\right)} e^{\frac{2k}{\mu} + \log\left( \Phi^{'}\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right)\right)} $$
We prefer this formulation since computing exp(2k/mu) alone would be impracticable since the values for $k$ may be greater than $10^3$, moreover there exists valid approximations or exact formulations for the logcdf and logpdf of the normal distribution.
$$ \frac{\partial}{\partial \mu}P(t)=\\ \Phi^{'}\left(\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}-1\right)\right) \color{blue}{\frac{\partial}{\partial \mu}\left(\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}-1\right)\right)} + \\ \color{green}{\frac{\partial}{\partial \mu}\left(\frac{2k}{\mu}\right)}e^{\frac{2k}{\mu}}\Phi\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right) + \\ \Phi^{'}\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right) \color{purple}{\frac{\partial}{\partial \mu}\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right)} e^{\frac{2k}{\mu}} \\ = \\ \Phi^{'}\left(\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}-1\right)\right) \color{blue}{\left(-\sqrt{\frac{k}{t}}\cdot \frac{t}{\mu^2}\right)} +\\ \color{green}{\left(-\frac{2k}{\mu^2}\right)}e^{\frac{2k}{\mu}}\Phi\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right) +\\ \Phi^{'}\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right) \color{purple}{\left(\sqrt{\frac{k}{t}}\cdot\frac{t}{\mu^2}\right)} e^{\frac{2k}{\mu}} $$
Which can be rewritten as
$$ \frac{\partial}{\partial \mu}P(t)= \\ \Phi^{'}\left(\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}-1\right)\right) \color{blue}{\left(-\sqrt{\frac{k}{t}}\cdot \frac{t}{\mu^2}\right)} +\\ \color{green}{\left(-\frac{2k}{\mu^2}\right)}e^{\frac{2k}{\mu}+\log\left(\Phi\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right)\right)} +\\ \color{purple}{\left(\sqrt{\frac{k}{t}}\frac{t}{\mu^2}\right)} e^{\frac{2k}{\mu}+\log\left(\Phi^{'}\left(-\sqrt{\frac{k}{t}}\left(\frac{t}{\mu}+1\right)\right)\right)} $$
Then, we can add to the gradient the following terms
$$ \frac{\partial}{\partial k} \left(-\eta_n\log\left(1-P(t_n)\right)\right) = \frac{\eta_n}{1-P(t_n)}\cdot\frac{\partial P(t_n)}{\partial k}\\ \frac{\partial}{\partial \mu} \left(-\eta_n\log\left(1-P(t_n)\right)\right) = \frac{\eta_n}{1-P(t_n)}\cdot\frac{\partial P(t_n)}{\partial \mu} $$
Actually in order to estimate $\mu$ I use an AR model of order $p$ (the AR coefficients are represented by the $\theta$ vector), so I have
$$ \mu=\theta_0\cdot1+\sum_{j=1}^{p}\theta_j\cdot x_j $$
And, by defining, $x_t = [1,x_1,...,x_p]$, the derivatives w.r.t. $theta$ are:
$$ \frac{\partial}{\partial \theta_j} \left(-\eta_n\log\left(1-P(t_n)\right)\right) = \frac{\eta_n}{1-P(t_n)}\cdot\frac{\partial P(t_n)}{\partial \mu}\cdot\frac{\partial \mu}{\partial \theta_j}\\ \frac{\eta_n}{1-P(t_n)}\cdot\frac{\partial P(t_n)}{\partial \mu}\cdot x_j $$
with $x_0 = 1$.
For the starting point
k = 1500.0
theta = [0.85, 0.4, -0.45, 0.1]
Given xt = [1., 0.93, 0.953, 0.953] we have:
(Pdb) mpfr_gradient
array([ 2.20329987e-03, -7.46972244e+00, -6.94684187e+00, -7.11864549e+00,
-7.11864549e+00])
(Pdb) approximated_gradient
array([-6.56127930e-05, -3.66268166e+00, -3.40628900e+00, -3.49053237e+00,
-3.49052827e+00])
The approximated gradient was computed with scipy.optimize.approx_fprime with a step size of 1e-10
Ok I solved it, the derivatives are right. My approximation for the
logcdfof the normal distribution wasn't just accurate enough.