So, I am slowly getting introduced to generalized method of moments (GMM), but I am getting confused over some issues, and this is one of them:
I heard that GMM solves the problem that an estimator may not be able to satisfy both conditions, that is $E(x_t\epsilon_t) = 0$ and $E(\epsilon_t) = 0$. But I am having a hard time understanding a GMM estimator function created in this case - so we can create two super-functions (above two) that are zero - and how are they combined to form a single zero function that GMM requires?
In other words, in slide 13 of http://homepage.univie.ac.at/robert.kunst/gmm.pdf, there is OLS as GMM, but I am having a hard time understanding how a function is being created. Can anyone explain this?
Edit: OK, so it seems that what I am really having a problem is this: in OLS, it is often said that we need to satisfy the above equation and variance conditions (that expectation of variance always the same.). But in GMM usage of OLS, while instrumental variables $k_t$ are used so that $E(k_t \epsilon_t) = 0$, there are no other further conditions imposed. So, what's going on?
Let's consider OLS first, then see how it can be viewed as a special case for GMM. I will not discuss GMM in general, just instrumental variables (IV). IV is an immediate generalization of OLS when the regressors are no longer predetermined.
OLS
No statistical assumptions
If we remove all statistical assumptions (except that the underlying model is linear), then the problem is a linear algebraic one:
$$ Y = X \beta, $$
where $Y \in \mathbb{R}^n$, $X \in \mathbb{R}^{n \times p}$ (so there are $p$ regressors and $n$ observations). Assuming $X$ is full-rank, then the least-squares solution, from just linear algebra, is
$$ \hat{\beta} = (X^TX)^{-1}X^TY. $$
Small sample properties
Now consider the linear model
$$ y_i = X_i \beta + \epsilon_i, $$
where $(x_i, \epsilon_i)$, $i = 1, \cdots, n$, are drawn from a probability space ("the model" or "DGP") $( \Pi\;_1 ^n \mathbb{R}^{p + 1}, \mu)$.
Statistical assumptions one needs to put on the model to get good small sample properties for $\hat{\beta}$:
Full-rank assumption. The random matrix $X$ must be full-rank $\mu$-a.s. for $\hat{\beta}$ to be well-defined.
Strict exogeneity, the conditional expectation $E[\epsilon|X] = 0 \in \mathbb{R}^n$. This implies $\hat{\beta}$ is unbiased.
Conditional homoskedasticity, $Var(\epsilon|X) = \sigma^2 I \in \mathbb{R}^{n \times n}$. This makes $\hat{\beta}$ BLUE. It has the smallest variance among linear estimators which are unbiased across all parameters $\beta$.
If assumption 3. is strengthened so that $\epsilon \in \mathbb{R}^n$ is multivariate normal, then the estimation problem is parametric and $\hat{\beta}$ becomes a MLE. Because the only source of error in $\hat{\beta}$ is $\epsilon$, this distributional assumption also specifies the distributions of test statistics like t- and F-statistic.
Large sample properties
Now replace the model by $( \Pi\;_1 ^{\infty} \mathbb{R}^{p + 1}, \mu)$ and consider the behavior of $\hat{\beta}$ as the sample size $n \rightarrow \infty$. Again start with the linear algebra and rewrite
$$ \hat{\beta} = (\frac{X^TX}{n})^{-1}\frac{X^TY}{n}. $$
The terms $(\frac{X^TX}{n})$ and $\frac{X^TY}{n}$ are just sample means. To connect sample mean to the true mean, one needs assumptions on the model.
Strict stationarity. The sequence of random vectors $(x_i, \epsilon_i)$, $i = 1, 2, \cdots$, needs to form a strictly stationary "time series". This makes the expectations $E(x_i^T x_i) \in \mathbb{R}^{p \times p}$ and $E(x_i \epsilon_i) \in \mathbb{R}^p$ independent of $i$.
The variance-covariance matrix $E(x_i^T x_i) \in \mathbb{R}^{p \times p}$ is positive-definite and $E(x_i \epsilon_i) = 0 \in \mathbb{R}^p$. This second condition $E(x_i \epsilon_i) = 0$ is the one that makes OLS a special case of IV. It is weaker than strict exogeneity.
Ergodicity. The sequence of random vectors $(x_i, \epsilon_i)$, $i = 1, 2, \cdots$, needs to form an ergodic "time series".
Assumptions 1. and 3. allow you to use the Law of Large Numbers on the estimation error
$$ \hat{\beta} - \beta = (\frac{X^TX}{n})^{-1}\frac{X^T\epsilon}{n}. $$
Because sample means converge to true mean ($\mu$ a.s.), you have consistency of OLS.
IV
So for OLS, key assumptions on the relationship between regressors $x_i$ and error $\epsilon_i$ that give you nice properties are
But in econometrics sometimes you may not have these. For example, in your model
$$ y_i = X_i \beta + \epsilon_i, $$
there may be error in measurement for the regressor. Instead of observing $X_i$, you observe $W_i = X_i + \eta_i$. It's not hard to see that $W_i$ will not be orthogonal to $\eta_i$ as required by predetermined-ness.
This is where IV comes in. Suppose you have another set of variables $z_i \in \mathbb{R^q}$ such that (assuming now $\{ (z_i, x_i, \epsilon_i) \}$ is still ergodic and stationary)
Then the IV estimator
$$ \hat{\beta}_{IV} = (X^T Z Z^T X)^{-1} X^T Z Z^T Y $$
is consistent by exactly the same arguments used for consistency of OLS. If you replace $z_i$ by $x_i$, IV becomes OLS.