I am reading this seminal paper on mixture models by Jordan and Xu. Even though the paper is about mixture models, the results derived in section 5 are applicable to the convergence analysis of a general gradient ascent based maximisation.
I am unable to figure out some seemingly straightforward matrix calculations in equation 13 and following analysis.
Let $\Theta^*$ be the optimal parameters corresponding the maximum likelihood. Let $H$ be its Hessian and $\eta$ be its learning rate. The following equation can be derived by expanding the likelihood around the parameter values from the $k+1$ -th iteration as $\Theta^{k+1} = \Theta^k + \eta \frac{\partial L(\Theta)}{\partial \Theta} \approx \Theta^k + \eta H(\Theta-\Theta^*)$ and applying Cauchy Swartz inequality: $$ \left.\left\|\Theta^{(k+1)}-\Theta^{*}\right\| \leq \| I+\eta H\left(\Theta^{*}\right)\right)\|\|\left(\Theta^{(k)}-\Theta^{*}\right) \| \quad \quad (1) $$
The paper further says that $$ \left\|I+\eta H\left(\Theta^{*}\right)\right\| \leq \lambda_{M}\left[I+\eta H\left(\Theta^{*}\right)\right]=r $$, where $\lambda_M, \lambda_m$ are the largest and smallest eigenvalues respectively.
Next the paper claims that (which I have trouble understanding) that $$ r=\max \left\{\left|1-\eta \lambda_{M}\left[-H\left(\Theta^{*}\right)\right]\right|, \quad\left|1-\eta \lambda_{m}\left[-H\left(\Theta^{*}\right)\right]\right|\right\} \quad \quad (2) $$ Further to achieve convergence, we need $r<1$ or $$ 0<\eta<2 / \lambda_{M}\left[-H\left(\Theta^{*}\right)\right] \quad \quad (3) $$
Can you please help me derive equation 2 and equation 3.
Further, it is mentioned that the minimum possible value of $r$ is obtained when $\eta = \frac{1}{\lambda_M[H(\Theta^*)]}$ with
$$ \begin{aligned} r_{\min } &=1-\lambda_{m}\left[H\left(\Theta^{*}\right)\right] / \lambda_{M}\left[H\left(\Theta^{*}\right)\right] \\ & \equiv 1-\kappa^{-1}\left[H\left(\Theta^{*}\right)\right] \end{aligned} $$
where $\kappa[H]=\lambda_{M}[H] / \lambda_{m}[H]$ is the condition number of $H$.
Assuming equation (2) and (3), I tried to derive the $\eta$ corresponding to $r_{min}$. I got $\eta = \frac{2}{\lambda_{m}[-H] + \lambda_{M}[-H]}$, which gave an $r_{min} = \frac{\kappa - 1}{\kappa + 1}$, which is clearly lower than $\frac{\kappa - 1}{\kappa}$. So, is the $r_{min}$ derived in the paper wrong?
Regarding equation (2) - the 2-norm of a matrix is defined as the largest singular value of a matrix. This singular value is an upper bounded by the largest eigenvalue of the matrix. Moreover, $\lambda_{\max}(I+\eta H)$ is equal to $r$ (that is more of a linear algebra question, but the proof is not difficult).
Regarding equation (3) - if $r> 1$, then going back to eq. (1), it's easy to see that the sequence diverges (or is constant when $r=1$). Therefore $r<1$ guarantees convergence, and it is influenced by the stepsize $\eta$. The specific conditions given in eq. (3) are simply derived from an algebraic manipulation so that we get $r<1$ (again, and eigenvalue issue).