Deriving MLE for covariance matrix using Robbins-Monro

738 Views Asked by At

I'm having some trouble completing exercise 2.37 in Bishop's Pattern Recognition and Machine Learning text. I'm not reading this text as part of a course, so this is not a homework question. Here's a paraphrased version of the exercise:

Verify that substituting the expression for a Gaussian distribution into the Robbins-Monro sequential estimation formula gives a result of the same form [as the MLE], and hence obtain an expression for the corresponding coefficients $a_N$.

Using the notation of the text, the Robbins-Monro update specialized for maximum likelihood estimation takes the following form: $$ \theta^{(N)} = \theta^{(N - 1)} - a_{N - 1} \frac{\partial}{\partial \theta^{(N - 1)}} \left[ -\ln p(x_N | \theta^{(N - 1)}) \right] \tag{1}, $$ where $x_N$ is the $N$th observation and $\theta^{(N)}$ are the values of the parameters of $p$ at iteration $N$.

What I did so far

The MLE for $\Sigma$ can be written as follows: \begin{align*} \hat{\Sigma}_N &= \frac{1}{N} \sum_{n = 1}^N (x_n - \mu)(x_n - \mu)^t \\ &= \frac{1}{N} (x_N - \mu)(x_N - \mu)^t + \frac{N - 1}{N} \hat{\Sigma}_{N -1 } \\ &= \hat{\Sigma}_{N - 1} - \frac{1}{N} \left( \hat{\Sigma}_{N - 1} - (x_N - \mu)(x_N - \mu)^t \right) \tag{2}. \end{align*}

The goal of the exercise is to arrive at the expression above using the Robbins-Monro procedure. To this end, we consider the NLL of the multivariate Gaussian, which is given by $$ -\ln p(x | \theta) = \frac{D}{2} \ln(2 \pi) + \frac{1}{2} \ln |\Sigma| + \frac{1}{2} (x - \mu)^t \Sigma^{-1} (x - \mu). $$ Differentiating with respect to $\Sigma$ causes the first term to vanish. For the second term, we have $$ \frac{\partial}{\partial \Sigma} \left( \frac{1}{2} \ln|\Sigma| \right) = \frac{1}{2} \Sigma^{-1}. $$ For the third term, one can show that $$ \frac{\partial}{\partial \Sigma} \left( \frac{1}{2} (x - \mu)^t \Sigma^{-1} (x - \mu) \right) = -\frac{1}{2} \Sigma^{-1} (x - \mu) (x - \mu)^t \Sigma^{-1}. $$ Substituting these results into (1) gives $$ \hat{\Sigma}_N = \hat{\Sigma}_{N - 1} - a_{N - 1} \left( \frac{1}{2} \hat{\Sigma}_{N - 1}^{-1} - \frac{1}{2} \hat{\Sigma}_{N - 1}^{-1} (x - \mu) (x - \mu)^t \hat{\Sigma}_{N - 1}^{-1} \right). $$ I'm not sure how to proceed from here. If we choose $a_N = 2 / N \;\hat{\Sigma}_{N - 1}^2$, then we get \begin{align*} \hat{\Sigma}_N &= \hat{\Sigma}_{N - 1} - \frac{2}{N} \hat{\Sigma}_{N - 1}^2 \left( \frac{1}{2} \hat{\Sigma}_{N - 1}^{-1} - \frac{1}{2} \hat{\Sigma}_{N - 1}^{-1} (x - \mu) (x - \mu)^t \hat{\Sigma}_{N - 1}^{-1} \right) \\ &= \hat{\Sigma}_{N - 1} - \frac{1}{N} \left( \hat{\Sigma}_{N - 1} - \hat{\Sigma}_{N - 1} (x - \mu) (x - \mu)^t \hat{\Sigma}_{N - 1}^{-1} \right). \end{align*} But this isn't of the same form as (2). Do you have any suggestions on how to proceed?

3

There are 3 best solutions below

0
On BEST ANSWER

$\hat{\Sigma}_{N-1}^{-1}(x - \mu)(x - \mu)^\top\hat{\Sigma}_{N-1}^{-1}$ is an outer product of two vector, $\hat{\Sigma}_{N-1}^{-1}(x - \mu)$ and $(x - \mu)^\top\hat{\Sigma}_{N-1}^{-1}$. Since $\hat{\Sigma}_{N-1}^{-1}$ is a symmetric matrix, those two vectors are equal.

$\hat{\Sigma}_{N-1}^{-1}(x - \mu) = (x - \mu)^\top\hat{\Sigma}_{N-1}^{-1}$

Therefore the two vectors are commutative. Note that the former is a column vector whereas the latter is a row vector, so to preserve dimension the first expression results in

$\hat{\Sigma}_{N-1}^{-1}(x - \mu)(x - \mu)^\top\hat{\Sigma}_{N-1}^{-1} = \hat{\Sigma}_{N-1}^{-1}\hat{\Sigma}_{N-1}^{-1}(x - \mu)(x - \mu)^\top$

Using this conversion gives (2).

This explanation is not formal but rather intuitive, but I hope this helps you.

2
On

This exercise need an additional assumption that ∑ must be a diagnal matrix, so we can ignore all of the elements beside the diagnal. So we can get enter image description here

0
On

Maybe the difficulty is that the original Robbins-Monro algorithm is formulated in a scalar form, while you try to derive its matrix form. I'm also working on this exercise. I try to tackle this problem by introducing a vector form and convert the matrix form into a vector form via vectorization.

Let $\theta_N\in\mathbb{R}^D,A_N\in\mathbb{R}^{D\times D}$, then a vector form of the Robbins-Monro algorithm can be written as follows, $$ \theta_N=\theta_{N-1}+A_{N - 1} \frac{\partial \ln p(x_N | \theta^{(N - 1)})}{\partial \theta^{(N - 1)}}\tag{1}. $$ For $\theta_N\in\mathbb{R}^{D\times D}$, its update rule can be written in a vector form via vectorization, i.e.,

$$ \text{vec}(\theta_N)=\text{vec}(\theta_{N-1})+A_{N - 1} \text{vec}\left(\frac{\partial \ln p(x_N | \theta^{(N - 1)})}{\partial \theta^{(N - 1)}}\right),A_{N-1}\in\mathbb{R}^{D^2\times D^2}\tag{2}. $$

Simplifying the notation of $\hat{\Sigma}_N$ as $\Sigma_N$, the sequential update rule for the MLE of $\Sigma$ can be rewritten as

$$ \text{vec}(\Sigma_N) = \text{vec}(\Sigma_{N - 1}) + \frac{1}{N} \text{vec}\left((x_N - \mu)(x_N - \mu)^T - \Sigma_{N - 1}\right).\tag{3} $$

To derive a vector form of Robbins-Monro algorithm for $\Sigma$, we first notice that

$$ \begin{aligned} \frac{\partial\ln p(x_N|\mu,\Sigma_{N-1})}{\partial \Sigma_{N-1}} &=\frac{1}{2}\Sigma_{N-1}^{-1}(x_N-\mu)(x_N-\mu)^T\Sigma_{N-1}^{-1}-\Sigma_{N-1}^{-1}\\ &=\frac{1}{2}\Sigma_{N-1}^{-1}\left((x_N-\mu)(x_N-\mu)^T-\Sigma_{N-1}\right)\Sigma_{N-1}^{-1}, \end{aligned} $$

Vectorizing the matrices on both sides, we have $$ \begin{aligned} \text{vec}\left(\frac{\partial\ln p(x_N|\mu,\Sigma_{N-1})}{\partial \Sigma_{N-1}}\right) &=\frac{1}{2}\text{vec}\left(\Sigma_{N-1}^{-1}\left((x_N-\mu)(x_N-\mu)^T-\Sigma_{N-1}\right)\Sigma_{N-1}^{-1}\right)\\ &=\frac{1}{2}(\Sigma_{N-1}^{-1}\otimes \Sigma_{N-1}^{-1})\text{vec}\left((x_N-\mu)(x_N-\mu)^T-\Sigma_{N-1}\right). \end{aligned} $$

Substituting the above result into Eq. (2), we find $$ \text{vec}(\Sigma_N) = \text{vec}(\Sigma_{N - 1}) + \frac{A_{N-1}}{2}(\Sigma_{N-1}^{-1}\otimes \Sigma_{N-1}^{-1})\text{vec}\left((x_N-\mu)(x_N-\mu)^T-\Sigma_{N-1}\right). $$

Set $A_{N-1}=\frac{2}{N}(\Sigma_{N-1}^{-1}\otimes \Sigma_{N-1}^{-1})^{-1}=\frac{2}{N}(\Sigma_{N-1}\otimes \Sigma_{N-1})$, we recover Eq. (3).

Comment. Eq. (1) and Eq. (2) are introduced without verifying the constraints w.r.t. $A_N$. It is expected that $$ \begin{aligned} \lim_{N\to\infty}A_N&=O\\ \left\|\sum^\infty_{n=1}A_N\right\| &=\infty\\ \left\|\sum^\infty_{n=1}A_N^2\right\|&<\infty, \end{aligned} $$ where $\|\cdot\|$ is an arbitrary matrix norm, and the limit is taken over a matrix sequence $\{A_N\}^\infty_{N=1}$. If anyone finds a rigorous formulation of the vector/matrix form of Robbins-Monro algorithm, please let me know.