Question on Maximum Likelihood Estimation of Linear Regression

1.4k Views Asked by At

I am studying a Tutorial on Maximum Likelihood Estimation in Linear Regression and I have a question.

When we have more than one regressor (a.k.a. multiple linear regression1), the model comes in its matrix form $y = X\beta + \epsilon$, (1)where $y$ is the response vector, $X$ is the design matrix with each its row specifying under what design or conditions the corresponding response is observed (hence the name), $\beta$ is the vector of regression coefficients, and $\epsilon$ is the residual vector distributing as a zero-mean multivariable Gaussian with a diagonal covariance matrix $\mathcal{N} \sim (0, \sigma^2I_N )$, where $I_N$ is the $N \times N$ identity matrix. Therefore $y \sim \mathcal{N}(X\beta,\sigma^2 I_N)$, (2)meaning that linear combination $X\beta$ explains (or predicts) response y with uncertainty characterized by a variance of $\sigma^2$.

Assume $y, \beta$, and $\epsilon \in \mathbb{R^n}$ Under the model assumptions, we aim to estimate the unknown parameters ($\beta$ and $\sigma^2$) from the data available ($X$ and $y$).

Maximum likelihood (ML) estimation is the most common estimator. We maximize the log-likelihood w.r.t. $\beta$ and $\sigma^2$ $\mathcal{L}(\beta,\sigma^2|y,X)=−\frac{N}{2}\log{2\pi}−\frac{N}{2}log{\sigma^2}− \frac{1}{2\sigma^2} (y−X\beta)^T(y−X\beta)$.

I am trying to understand that how the log-likelihood, $\mathcal{L}(\beta,\sigma^2|y,X)$, is formed. Normally, I saw these problems when we have $\mathbf{x_i}$ as vector of size d(d is number of parameter for each data). specifically, when $\mathbf{x_i}$ is a vector, I wrote is as $\ln \prod\limits_{i=1}^{N} \frac{1}{\sqrt { (2\pi)^d\boldsymbol \sigma^2 } } \exp\left(-{1 \over 2\sigma^2} (\mathbf{x_i}-\boldsymbol\mu)^{\rm T} \boldsymbol ({\mathbf x_i}-\boldsymbol\mu)\right) = \sum\limits{_{i}}\ln{\frac{1}{\sqrt { (2\pi)^d\boldsymbol \sigma^2 } } \exp\left(-{1 \over 2\sigma^2} (\mathbf{x_i}-\boldsymbol\mu)^{\rm T} \boldsymbol ({\mathbf x_i}-\boldsymbol\mu)\right)}$. But in the case that is shown in this tutorial, there is no index I to apply summation.

I would appreciate any insights on this problem. Thanks in advance.

1

There are 1 best solutions below

6
On BEST ANSWER

I think it's relatively easy to get mixed up here due to notation. In the case you present from the textbook, they're considering a product of one-dimensional gaussians which are independent from each other, and then writing the form as a multi-dimensional gaussian (since then the covariance matrix of this multidimensional gaussian is exactly $\sigma^2I$). E.g. note that each sample has the form $y_i - (X\beta)_i \sim N(0, \sigma^2)$. Taking the product of these distributions yields the multi-dimensional gaussian above.

In your exposition, on the other hand, you're writing a multi-dimensional gaussian which is i.i.d.; this is different than what the textbook is referring to, since the mean should change between distributions (e.g. the samples are independent, but not identically drawn, since we're observing different data points with some additional noise, $\varepsilon \sim N(0, \sigma^2)$).