I'm stuck with a problem that I was expecting to be simple, but I am having problems putting my fingers on how to solve it rigorously. In its simplest instance the setting is linear estimation from linear noisy measurements, i.e., we observe
$$\mathbf y = \mathbf H \mathbf \theta + \mathbf w,$$
where $\mathbf H$ is $N \times M$ for $M<N$ full-rank (i.e., we are in the overdetermined case) and known, and $\mathbf w \sim {\mathcal N}(\mathbf 0, \mathbf R_w)$ is zero-mean Gaussian. The issue is that ${\rm rank}(\mathbf R_w)=r<N$ and I need to figure out how to deal with this.
The standard approach to such problems would be to look at the log-likelihood, which gives something like ${\rm const} - \frac 12 (\mathbf y - \mathbf H \mathbf \theta)^{\rm T}\mathbf R_w^{-1}(\mathbf y - \mathbf H \mathbf \theta)$. Maximizing this over $\theta$ gives a maximum likelihood estimator of the form $$ \hat{\mathbf \theta} = \left(\mathbf H^{\rm T} \mathbf R_w^{-1} \mathbf H\right)^{-1} \mathbf H^{\rm T} \mathbf R_w^{-1} \mathbf y,$$ which coincides with the (weighted) least squares estimator, is also unbiased and all kinds of nice things.
The issue here is clear: since my $\mathbf R_w$ is not invertible, none of this works. I cannot even express the likelihood nicely. The density of $\mathbf y$ is singular, a proper expression for it would need distributions. However, this will not help me getting to my maximum likelihood estimator.
My intuition tells me I should get away with projecting $\mathbf y$ on a lower-dimensional space where the effective noise vector then has full rank. But I'm not sure how to do this and show that the resulting estimate is still optimal. Or is the answer that an ML estimator does not exist in this case and therefore it is not clear what optimality even means? That would surprise me though. After all, things like this can happen easily, all I need to do is repeat one of my observations in $\mathbf y$. I'm expecting a simple answer that this does not lead to a better estimate.
I'm thinking the way to go is to define some lower-dimensional $\mathbf z = \mathbf P \mathbf y = \mathbf P \mathbf H \mathbf \theta + \mathbf P \mathbf w$ where the effective noise vector $\mathbf P \mathbf w$ has a full-rank covariance and then show that $\mathbf z$ is a sufficient statistic for estimating $\theta$. But how do I construct $\mathbf P$? I started taking the $r$ dominant eigenvector of $\mathbf R_w$ but this seems wrong: if I project onto this space, I'm killing a part of $\mathbf H$, it might happen that $\mathbf H$ lives in the orthogonal subspace partly, or even entirely. The latter is funny since in this case I could project $\mathbf y$ into the column space of $\mathbf H$ and would get a noise-free observation such that I can estimate $\mathbf \theta$ exactly. This might mean that I need some conditions on the range of $\mathbf H$ and $\mathbf R_w$.
Still, intuitively there should be a simple solution to this which coincides with WLS for the special case $r=N$. Can anyone help me shed some light on this?
After some more thought into this I can now answer my own question. Most importantly and possibly helpful for others: in the case $\mathbf R_{\mathbf w}\succeq \mathbf 0$, the likelihood can be written as
$$ f_{\mathbf y}(\mathbf y | \mathbf \theta) = \begin{cases} {\rm const} \cdot {\rm e}^{-\frac 12 (\mathbf y - \mathbf H \mathbf \theta)^{\rm T} \mathbf R_{\mathbf w}^+ (\mathbf y - \mathbf H \mathbf \theta)} & \mathbf R_{\mathbf w}^\perp \cdot(\mathbf y - \mathbf H \mathbf \theta) = \mathbf 0 \\ 0 & \text{otherwise}, \end{cases}$$
where $\mathbf R_{\mathbf w}^+$ is a generalized inverse and $\mathbf R_{\mathbf w}^\perp$ is a projector onto the complement of $\mathbf R_{\mathbf w}$. The best way of dealing with the constraint is to decompose $\mathbf \theta$ into a vector in the range of $\mathbf R_{\mathbf w}^\perp \mathbf H$ satisfying the constraint and another vector in its complement that can be chosen freely. We can then form the derivative of the (log-)likelihood with respect to the latter vector and set it to zero to find the ML estimator.
As expected, the resulting ML estimator is linear, unbiased, and achieves the Cramer-Rao Bound. It is therefore also BLUE. As an extra bonus, for the case where $r < N-M$, we can satisfy the constraints in the complement space of $\mathbf R_{\mathbf w}$ and achieve an estimator with zero variance, as expected.
edit Since I was asked for the ML estimator: I haven't found a super short way of writing it, here is my current best attempt: $$ \begin{align} \mathbf{\hat{\theta}}_{\rm ML} = \left(\mathbf F^+ \mathbf P^{\rm T} + \mathbf F^\perp \left[\mathbf F^{\perp^{\rm T}} \mathbf H^{\rm T} \mathbf \Sigma^+ \mathbf H \mathbf F^\perp\right]^{-1} \mathbf F^{\perp^{\rm T}} \mathbf H^{\rm T} \mathbf \Sigma^+ \left(\mathbf I - \mathbf H \mathbf F^+ \mathbf P^{\rm T} \right)\right)\mathbf y. \end{align} $$ where $\mathbf F = \mathbf P^{\rm T} \mathbf H\in \mathbb{R}^{N-r \times M}$ and $\mathbf P \in \mathbb{R}^{N \times N-r}$ is an orthonormal basis for the null space of $\mathbf R_w$.
Note that for $r < N-M$, this estimator simplifies to $$\mathbf{\hat{\theta}}_{\rm ML} = \mathbf F^+ \mathbf P^{\rm T} \mathbf y.$$
In this case, inserting $\mathbf y = \mathbf H \mathbf \theta + \mathbf w$ gives $\mathbf{\hat{\theta}}_{\rm ML} = \mathbf \theta$ (since $\mathbf P^{\rm T} \mathbf w = \mathbf 0$), i.e., we get an exact estimate with zero variance.