I was reading the following Wikipedia page https://en.wikipedia.org/wiki/Scoring_algorithm under the "Sketch of Derivation" section:
Let $Y_1,\ldots,Y_n$ be random variables, independent and identically distributed with twice differentiable p.d.f. $f(y; \theta)$, and we wish to calculate the maximum likelihood estimator (M.L.E.) $\theta^*$ of $\theta$. First, suppose we have a starting point for our algorithm $\theta_0$, and consider a Taylor expansion of the score function, $V(\theta)$, about $\theta_0$:
$$V(\theta) \approx V(\theta_0) - \mathcal{J}(\theta_0)(\theta - \theta_0)$$
where
$$\mathcal{J}(\theta_0) = -\sum_{i=1}^n \left. \nabla \nabla^\top \right|_{\theta=\theta_0} \log f(Y_i; \theta)$$
is the observed information matrix at $\theta_0$.
Now, setting $\theta$ = $\theta*$ , using that $V(\theta^*) = 0$ and rearranging gives us:
$$\theta^* \approx \theta_{0} + \mathcal{J}^{-1}(\theta_{0})V(\theta_{0}).$$
We therefore use the algorithm
$$\theta_{m+1} = \theta_{m} + \mathcal{J}^{-1}(\theta_{m})V(\theta_{m})$$
and under certain regularity conditions, it can be shown that $\theta_m \rightarrow \theta^*$.
My Question: I am trying to learn about the "regularity conditions" as well as the general proof required to prove the following statement:
And under certain regularity conditions, it can be shown that $\theta_m \rightarrow \theta*$.
I tried consulting different sources but I could not understand/find more information about these regularity conditions and a general mathematical proof that shows this convergence property.
Can someone please help me understand this?
Thanks!
- Note: I understand that $\theta* \rightarrow \theta$ , i.e. the classical consistency property of MLE - the MLE estimator converges to the true value as the sample size approaches infinity ....but I am trying to understand why $\theta_m$ (a numerical approximation of $\theta*$) approaches $\theta_*$ as the number of iterations $m$ becomes large.
The pattern behind that is Newton's method (for optimization). You start with $$ \phi(\theta)=\sum_{i=1}^n\ln f(Y_i;\theta)\approx \phi(\theta_0)+\phi'(\theta_0)(\theta-\theta_0)+\frac12\phi''(\theta_0)[\theta-\theta_0,\theta-\theta_0]+O(\|\theta-\theta_0\|^3) $$ Being able to write a big O there is already one regularity condition.
At an extremum of $\phi$ its derivative or gradient is zero $$ 0=V(\theta_*)=\nabla\phi(\theta_*)=\phi'(\theta_*)^\top,\\ V(\theta)=V(\theta_0)+V'(\theta_0)(\theta-\theta_0)+O(\|\theta-\theta_0\|^2) $$ To extract the derivative of $V$, the Hessian matrix $J=-V'$ of $\phi$, from the bilinear form of the second derivative of $\phi$ requires some linear algebra bricolage, transposition in one argument. Among other things one would have to recall what $\nabla$ stands for. Usually it is the gradient operator, $\nabla \phi(\theta)=\phi'(\theta)^\top$. This seems also to be the case here, so that $\nabla\nabla^\top$ stands for the matrix of all (mixed) second derivative operators.
Thus solving $$ 0=V(\theta_0)-J(\theta_0)(\theta-\theta_0) $$ is the same as solving $$ 0=V(\theta_0)+V'(\theta_0)(\theta-\theta_0), $$ which defines the Newton step. One level back it is the same as finding the extremal point of the quadratic function $$ \phi(\theta_0)+\phi'(\theta_0)(\theta-\theta_0)+\frac12\phi''(\theta_0))[θ−θ_0,θ−θ_0] $$ Thus you get also the convergence conditions of Newton's method as part of the regularity conditions. These may be expressed in $\phi$ or $f$, and thus perhaps simplify somewhat.