I am reading Numerical Optimization by Nocedal & Wright, and I am having trouble understanding some aspects of the proof of Theorem $3.7$. I have been stuck on this theorem for many hours, so any help is greatly appreciated.
There are two things I don't understand:
The theorem is an iff statement. The author proves one direction, but I don't see how to prove the reverse.
The author seems to use the assumption that the Hessian is Lipschitz, but this is not an explicit assumption of the theorem. Is this a mistake from the author? (I checked the errata and this wasn't on there)
The following are several lines the author references in the proof. The theorem and the proof follows.
$$\|x_k + p_k^N - x^*\| \le L\|x_k - x^*\|^2 \tag{3.33}$$ (The above is where my point #2 comes from. This inequality was derived in the proof of an earlier theorem about quadratic convergence of Newton's Method and in that theorem we had a hypothesis that the Hessian is Lipschitz, which was used to prove the above inequality.) $$p_k = -B_k^{-1} \nabla f_k \hspace{1cm} \tag{3.34}$$ where $B_k$ is symmetric and pos. def., $$\lim_{k \to \infty} \frac{\|(B_k - H_f(x^*))p_k\|}{\|p_k\|} = 0 \tag{3.36}$$
Theorem $\textbf{3.7}$: Suppose that $f:\mathbb{R}^n \to \mathbb{R}$ is twice continuously differentiable. Consider the iteration $x_{k+1} = x_k + p_k$ (that is, the step length $\alpha_k$ is uniformly $1$) and that $p_k$ is given by $(3.34)$. Let us assume also that $(x_k)$ converges to a point $x^*$ such that $\nabla f(x^*) = 0$ and $H_f(x^*)$ is positive definite. Then $(x_k)$ converges superlinearly if and only if $(3.36)$ holds.
Proof: We first show that $(3.36)$ is equivalent to $$p_k - p_k^N = o(\|p_k\|) \tag{3.37}$$ where $p_k^N = - H_f(x_k)^{-1} \nabla f_k$ is the Newton step. Assuming $(3.36)$ holds, we have that \begin{align*} p_k - p_k^N & = H_{f}(x_k)^{-1}(H_f(x_k)p_k + \nabla f_k)\\ &= H_{f}(x_k)^{-1}(H_{f}(x_k) - B_k)p_k\\ &= O(\|(H_f(x_k) - B_k)p_k\|)\\ &= o(\|p_k\|) \end{align*} where we have used the fact that $\|H_f(x_k)^{-1}\|$ is bounded above for $x_k$ sufficiently close to $x^*$, since the limiting Hessian $H_f(x_*)$ is positive definite. The converse follows readily of we multiply both sides of $(3.37)$ by $H_f(x_k)$ and recall $(3.34)$.
By combining $(3.33)$ and $(3.37)$, we obtain that $$\|x_k+p_k-x^*\| \le\|x_k+p_k^N-x^*\|+\|p_k-p_k^N\|=O(\|x_{k}-x^*\|^2)+o(\|p_k\|).$$ A simple manipulation of this inequality reveals that $\|p_k\| = O(\|x_k - x^*\|),$ so we obtain $$\|x_k+p_k-x^*\| \le o(\|x_k-x^*\|),$$ giving the superlinear convergence result.
The proof of this theorem does have a few places to be patched up. For Question 2, the proof uses (3.33) but it only needs a weaker form:\begin{gather*} x_k + p^N_k - x^* = o(\|x_k - x^*\|) \quad (k → ∞),\tag{3.33$'$} \end{gather*} In fact, by the Taylor expansion,$$ ∇f(x) = ∇f(x^*) + Hf(x^*) (x - x^*) + o(\|x - x^*\|) \quad (x → x^*). $$ Note that $∇f(x^*) = \boldsymbol{0}$, $p^N_k = -(Hf(x_k))^{-1} ∇f(x_k)$, and $\|(Hf(x))^{-1}\|$ is bounded in the neighborhood of $x^*$. Because $(Hf(x))^{-1} Hf(x^*) → I$ as $x → x^*$, so\begin{align*} &\mathrel{\phantom{=}}{} x_k + p^N_k - x^* = (x_k - x^*) - (Hf(x_k))^{-1} ∇f(x_k)\\ &= (x_k - x^*) - (Hf(x_k))^{-1} Hf(x^*) (x_k - x^*) + o(\|x_k - x^*\|)\\ &= (I - (Hf(x_k))^{-1} Hf(x^*)) (x_k - x^*) + o(\|x_k - x^*\|)\\ &= o(\|x_k - x^*\|) \quad (k → ∞). \end{align*}
Now for question 1, since (3.36) and (3.37) are equivalent, it suffices to prove that$$ x_k + p_k - x^* = o(\|x _k- x^*\|) \ (k → ∞) \Longleftrightarrow p_k - p^N_k = o(\|p_k\|) \ (k → ∞). $$ On the one hand, if $p_k - p^N_k = o(\|p_k\|)$ ($k → ∞$), then$$ \|p_k\| - \|p_k - p^N_k\| \leqslant \|p^N_k\| \leqslant \|p_k\| + \|p_k - p^N_k\| $$ implies that $\|p^N_k\| \sim \|p_k\|$ ($k → ∞$), so\begin{gather*} p_k - p^N_k = o(\|p_k\|) = o(\|p^N_k\|)\\ = o(\|(Hf(x_k))^{-1} (x_k - x^*)\|) = o(\|x_k - x^*\|) \quad (k → ∞). \end{gather*} Combining with $\|x_k + p_k - x^*\| \leqslant \| x_k + p^N_k - x^*\| + \|p_k - p^N_k\|$ and (3.33$'$) yields$$ x_k + p_k - x^* = o(\|x_k - x^*\|) \quad (k → ∞). $$ On the other hand, if $x_k + p_k - x^* = o(\|x _k- x^*\|)$ ($k → ∞$), then$$ \|p_k - p^N_k\| \leqslant \|x_k + p_k - x^*\| + \| x_k + p^N_k - x^*\| $$ and (3.33$'$) imply that$$ \|p_k - p^N_k\| = o(\|x_k - x^*\|) = o(\|p^N_k\|) \quad (k → ∞). $$ Combining with$$ \|p^N_k\| - \|p_k - p^N_k\| \leqslant \|p_k\| \leqslant \|p^N_k\| + \|p_k - p^N_k\| $$ yields $\|p_k\| \sim \|p^N_k\|$ ($k → ∞$) and$$ \|p_k - p^N_k\| = o(\|p^N_k\|) = o(\|p_k\|) \quad (k → ∞). $$