Distributional assumptions in Maximum likelihood estimator (MLE) and least squares estimator (LSE)

210 Views Asked by At

Many textbooks do mention that MLE does need some distributional assumptions but I could never find which they are. LSE on the opposite, doesn't need distributional assumptions but when missing certain distributional properties, the estimator will possibly not have nice estimator properties. Which these are isn't mentioned either. Therefore my questions:

1) Which distributional assumptions does the MLE have?

2) Given which properties does the LSE have which favourable properties?

1

There are 1 best solutions below

6
On BEST ANSWER

In the basic linear model $$Y_t=\beta_0+\beta_1X_{1,t}+\cdots+\beta_kX_{k,t}+u_t,$$ where $X_{i,t}$ values are fixed (that is, not randomly determined but selected before measuring the $Y_t$ values), $u_t$ are random variables with a certain probability distribution.

That LSE don't need to specify a distribution for $u_t$ is evident since LSE are defined to be the values $$\hat \beta_0,\ldots,\hat \beta_k$$ such that the function $$S(\hat \beta_0,\ldots,\hat \beta_k)=\sum_{t=1}^T(Y_t-\hat\beta_0-\hat\beta_1X_{1,t}-\cdots-\hat\beta_kX_{k,t})^2$$ reaches its minimum value, and these values do not depend on the distribution of the $u_t$ variables.

On the other hand, the MLE are defined as the values $$\hat \beta_0,\ldots,\hat \beta_k$$ such that the likelihood function $$L(\hat \beta_0,\ldots,\hat \beta_k)=\prod_{t=1}^T f(u_t)=$$$$=\prod_{t=1}^T f(Y_t-\hat\beta_0-\hat\beta_1X_{1,t}-\cdots-\hat\beta_kX_{k,t})$$ reaches a maximum. But here, $f$ is the probability density function of the $u_t$ variables, so assuming a certain distribution for those $u_t$ random variables is an unavoidable first step for maximum likelihood estimation.

More precisely: there is not one set of MLE for the linear model, but actually one for each specific distribution assumption. And while from one case to another the properties of those different estimates will vary, for most cases there are some basic properties like consistency or asymptotic efficiency that will hold, since they're common to all MLE under quite general conditions.

On the other hand, there is just one set of LSE... but their properties will depend on the actual distribution of the $u_t$ r.v.

To summarize some known results, let's assume that (as was said at the beginning), the $X_{i,t}$ variables are not random. Then, if $\hat\beta_i$ are the LSE:

  • if $E(u_t)=0,\;\forall t$, then $E(\hat\beta_i)=\beta_i$ (the LSE are unbiased);
  • if —in addition— $$Var(u_t)=\sigma^2_u,\;\forall t$$ (the same for all $t$, also known as the homoskedasticity assumption) and if $$cov(u_t,u_s)=0,\; t\neq s$$ (no autocorrelation assumption) then the LSE have minimum variance among all linear unbiased estimators (BLUE);
  • if also the distribution for $u_t$ is the normal distribution, then the LSE have minimum variance among ALL unbiased estimators, that is, they are efficient estimators (also known as MVUE or UMVUE).

Regarding the usual distributional assumption to perform MLE, actually it depends on the characteristics of the model; but in the simplest case it is quite usual to assume a normal distribution with zero mean, homoskedasticity and no autocorrelation, that is $$u_t\sim N(0,\sigma^2_u),\; \forall t,\quad \wedge \quad cov(u_t,u_s)=0,\; t\neq s, $$ in which case the MLE for the $\beta_i$ parameters coincide with the LSE. Also, $\sigma^2_u$ can be estimated by maximum likelihood if unknown, and it's estimator turns out to be $$\hat\sigma_u^2=\frac1n \sum_{t=1}^T \hat u_t,$$ where $$\hat u_t=Y_t-\hat\beta_0-\hat\beta_1X_{1,t}-\cdots-\hat\beta_kX_{k,t}.$$