I'm having some trouble completing exercise 2.37 in Bishop's Pattern Recognition and Machine Learning text. I'm not reading this text as part of a course, so this is not a homework question. Here's a paraphrased version of the exercise:
Verify that substituting the expression for a Gaussian distribution into the Robbins-Monro sequential estimation formula gives a result of the same form [as the MLE], and hence obtain an expression for the corresponding coefficients $a_N$.
Using the notation of the text, the Robbins-Monro update specialized for maximum likelihood estimation takes the following form: $$ \theta^{(N)} = \theta^{(N - 1)} - a_{N - 1} \frac{\partial}{\partial \theta^{(N - 1)}} \left[ -\ln p(x_N | \theta^{(N - 1)}) \right] \tag{1}, $$ where $x_N$ is the $N$th observation and $\theta^{(N)}$ are the values of the parameters of $p$ at iteration $N$.
What I did so far
The MLE for $\Sigma$ can be written as follows: \begin{align*} \hat{\Sigma}_N &= \frac{1}{N} \sum_{n = 1}^N (x_n - \mu)(x_n - \mu)^t \\ &= \frac{1}{N} (x_N - \mu)(x_N - \mu)^t + \frac{N - 1}{N} \hat{\Sigma}_{N -1 } \\ &= \hat{\Sigma}_{N - 1} - \frac{1}{N} \left( \hat{\Sigma}_{N - 1} - (x_N - \mu)(x_N - \mu)^t \right) \tag{2}. \end{align*}
The goal of the exercise is to arrive at the expression above using the Robbins-Monro procedure. To this end, we consider the NLL of the multivariate Gaussian, which is given by $$ -\ln p(x | \theta) = \frac{D}{2} \ln(2 \pi) + \frac{1}{2} \ln |\Sigma| + \frac{1}{2} (x - \mu)^t \Sigma^{-1} (x - \mu). $$ Differentiating with respect to $\Sigma$ causes the first term to vanish. For the second term, we have $$ \frac{\partial}{\partial \Sigma} \left( \frac{1}{2} \ln|\Sigma| \right) = \frac{1}{2} \Sigma^{-1}. $$ For the third term, one can show that $$ \frac{\partial}{\partial \Sigma} \left( \frac{1}{2} (x - \mu)^t \Sigma^{-1} (x - \mu) \right) = -\frac{1}{2} \Sigma^{-1} (x - \mu) (x - \mu)^t \Sigma^{-1}. $$ Substituting these results into (1) gives $$ \hat{\Sigma}_N = \hat{\Sigma}_{N - 1} - a_{N - 1} \left( \frac{1}{2} \hat{\Sigma}_{N - 1}^{-1} - \frac{1}{2} \hat{\Sigma}_{N - 1}^{-1} (x - \mu) (x - \mu)^t \hat{\Sigma}_{N - 1}^{-1} \right). $$ I'm not sure how to proceed from here. If we choose $a_N = 2 / N \;\hat{\Sigma}_{N - 1}^2$, then we get \begin{align*} \hat{\Sigma}_N &= \hat{\Sigma}_{N - 1} - \frac{2}{N} \hat{\Sigma}_{N - 1}^2 \left( \frac{1}{2} \hat{\Sigma}_{N - 1}^{-1} - \frac{1}{2} \hat{\Sigma}_{N - 1}^{-1} (x - \mu) (x - \mu)^t \hat{\Sigma}_{N - 1}^{-1} \right) \\ &= \hat{\Sigma}_{N - 1} - \frac{1}{N} \left( \hat{\Sigma}_{N - 1} - \hat{\Sigma}_{N - 1} (x - \mu) (x - \mu)^t \hat{\Sigma}_{N - 1}^{-1} \right). \end{align*} But this isn't of the same form as (2). Do you have any suggestions on how to proceed?
$\hat{\Sigma}_{N-1}^{-1}(x - \mu)(x - \mu)^\top\hat{\Sigma}_{N-1}^{-1}$ is an outer product of two vector, $\hat{\Sigma}_{N-1}^{-1}(x - \mu)$ and $(x - \mu)^\top\hat{\Sigma}_{N-1}^{-1}$. Since $\hat{\Sigma}_{N-1}^{-1}$ is a symmetric matrix, those two vectors are equal.
$\hat{\Sigma}_{N-1}^{-1}(x - \mu) = (x - \mu)^\top\hat{\Sigma}_{N-1}^{-1}$
Therefore the two vectors are commutative. Note that the former is a column vector whereas the latter is a row vector, so to preserve dimension the first expression results in
$\hat{\Sigma}_{N-1}^{-1}(x - \mu)(x - \mu)^\top\hat{\Sigma}_{N-1}^{-1} = \hat{\Sigma}_{N-1}^{-1}\hat{\Sigma}_{N-1}^{-1}(x - \mu)(x - \mu)^\top$
Using this conversion gives (2).
This explanation is not formal but rather intuitive, but I hope this helps you.