I am interested in a Bayesian treatment of (univariate) linear regression in the presence of correlated residuals, but I am somewhat stuck trying to come up with a neat parametrization for a conjugate prior over the regression coefficients.
Specifically, for $i\in\{1,\ldots, N\}$, let $y_i\in\mathbb{R}$, $x_i\in\mathbb{R}^p$, $\epsilon_i\in\mathbb{R}$, and let $\beta\in\mathbb{R}^p$. Consider the model given by
$$ y_i = x_i^T\beta + \epsilon_i\,, $$ where our residuals $\{\epsilon_i\}$ are correlated: $$ (\epsilon_1, \ldots, \epsilon_N) \sim \mathcal{N}(0, A)\,. $$
For some context, this model might be pretty natural in the setting where the samples are spatially distributed and you expect samples that were taken close together to deviate in some systematic way from your global linear model or if you expect some heteroskedasticity.
Back to the model. Let $\mathbf{X}$ be our $N\times p$ design matrix and let $y$ be our $N\times 1$ response vector: $$ \mathbf{X} = \begin{bmatrix} x_1^T\\ \vdots\\ x_N^T \end{bmatrix},\,\quad y = \begin{bmatrix} y_1\\ \vdots\\ y_N \end{bmatrix}\,. $$
For conjugate Bayesian analysis, I am referencing Kevin Murphy's Conjugate Bayesian analysis of the Gaussian distribution, where he writes the likelihood of the observations in terms of the parameters and some statistics of the observations.
We can write the likelihood: $$ p(y | \mathbf{X}, \beta, A) \propto |A|^{-\frac{1}{2}} \exp\left[ -\frac{1}{2} (y-\mathbf{X}\beta)^T A^{-1} (y-\mathbf{X}\beta)\right]\,. $$
Let $$ \hat{\beta} \equiv \left(\mathbf{X}^TA^{-1}\mathbf{X}\right)^{-1}\mathbf{X}^T A^{-1} y\,. $$
Then $$ \begin{align} (y-\mathbf{X}\beta)^T A^{-1} (y-\mathbf{X}\beta) &= \left[\left(y-\mathbf{X}\hat{\beta}\right) + \mathbf{X}\left(\hat{\beta} - \beta\right)\right]^T A^{-1} \left[\left(y-\mathbf{X}\hat{\beta}\right) + \mathbf{X}\left(\hat{\beta} - \beta\right)\right]\\ &= \left[\left(y-\mathbf{X}\hat{\beta}\right) ^T A^{-1} \left(y-\mathbf{X}\hat{\beta}\right)\right]\quad +\quad \left[\left(\hat{\beta} - \beta\right)^T \mathbf{X}^T A^{-1} \mathbf{X}\left(\hat{\beta} - \beta\right)\right] \qquad\text{(yes, other terms vanish!)}\,. \end{align} $$
So $$ \begin{align} p(y | \mathbf{X}, \beta, A) &\propto |A|^{-\frac{1}{2}} \exp\left[ -\frac{1}{2} \left[\left(y-\mathbf{X}\hat{\beta}\right) ^T A^{-1} \left(y-\mathbf{X}\hat{\beta}\right)\right]\right] \exp\left[ -\frac{1}{2} \left[\left(\hat{\beta} - \beta\right)^T \mathbf{X}^T A^{-1} \mathbf{X}\left(\hat{\beta} - \beta\right)\right]\right]\\ &= |A|^{-\frac{1}{2}} \exp\left[ -\frac{1}{2} \text{tr} \left[A^{-1} \left(y-\mathbf{X}\hat{\beta}\right)\left(y-\mathbf{X}\hat{\beta}\right) ^T \right]\right] \exp\left[ -\frac{1}{2} \text{tr} \left[\mathbf{X}^T A^{-1} \mathbf{X}\left(\hat{\beta} - \beta\right) \left(\hat{\beta} - \beta\right)^T \right]\right]\,. \end{align} $$
The first half of this expression seems to indicate an inverse Wishart prior for $A$, $$ A \sim \mathcal{W}^{-1}(\nu, \Sigma)\,, $$ while the second half of the expression seems to indicate a conditional prior $$ \beta | A \sim \mathcal{N}(\beta_0, (L^T A^{-1} L)^{-1}) $$ with $L\in \mathbb{R}^{N\times p}$, but this is pretty unreasonable since this relies on specifying a full set of prior observations instead of some sufficient statistic. I know this more or less falls out of the fact that the shape of $A$ is $n\times n$, but it feels like there "should" be a way to handle this nicely. I could drop the dependence of $\beta$ on $A$ entirely, but then I would lose the conjugacy of my joint prior, right?
If we can write $\Sigma_{i,j}$ as a function $k(x_i, x_j)$, it becomes very natural to specify the $N\times N$ prior over $A$. Is that of any use to $\beta|A$?
Thank you in advance!