I have been studying OLS in matrix form and I understand that when $y=X\beta +\varepsilon$, and when $E[\varepsilon|X]=0$ and $Cov[\varepsilon|X]=\sigma^{2}I$, $\hat{\beta}=(X^TX)^{-1}X^{T}y$ is the best linear unbiased estimator as in it has the smallest variance. I have two questions:
- What is the formal definition of the 'variance' of the random vector $\hat{\beta}$? I have seen different definitions including every entry on the diagonal of the covariance matrix of $\hat{\beta}$ should not be larger than that of any other linear unbiased estimator; the trace of the covariance matrix of $\hat{\beta}$ should not be larger than that of any other linear unbiased estimator; $Cov[\tilde {\beta}] - Cov[\hat{\beta}]$ should be positive semi-definite where $\tilde{\beta}$ denotes any other linear unbiased estimator. Which one is in the original Gauss-Markov theroem?
- How do I really solve for the BLUE at the first place? I have searched many reference materials where they kind of prove by contradiction by making an educated guess that $\tilde{\beta}=[(X^TX)^{-1}X^{T}+D]y$, but what if I did not know what form to start with and simply want to solve for BLUE using optimization? This loops back to question 1 as in I am not sure how to formulate the objective function.
Edit on question 1: I have come to realize that $Cov[\tilde {\beta}] - Cov[\hat{\beta}]$ being positive semi-definite guarantees the trace of it to be non-negative and that its elements on the diagonal are all non-negative.
There are multiple versions of the Gauss-Markov theorem, each one using a different concept of variance. Here are a few that I am aware of; I'm not sure which one is the 'original' one.
[Forms (1), (2), (3) correspond to your first, third, and second examples respectively.]
In form (1) the variance is the familiar variance of a random variable, since $v^TCy$ is scalar.
In (2) the variance is the familiar covariance matrix of a random vector, so $\operatorname{Var}(Cy)=E[(Cy-\beta)(Cy-\beta)^T]$; note the concept of 'minimize' refers to the partial ordering on square matrices: $M\ge N$ means $M-N$ is positive semi-definite.
In (3) the entity $E(\| Cy-\beta\|^2)$ can be regarded as a type of variance; it's the expectation of $(Cy-\beta)^T(Cy-\beta)$ (the squared Euclidean ($\ell^2$) distance between the vector $Cy$ and its expectation $\beta$), and corresponds to the trace of the covariance matrix.
Forms (1) and (2) are equivalent by definition of positive semi-definite matrix. Form (2) implies form (3), because of the remark you made about trace of positive semi-definite matrices; but I don't believe the converse is true, since the variance criterion differs.
If you know least squares then you are familiar with the fact that $\hat Cy$ is the least squares estimate for $\beta$, and is unbiased, so it is not totally unreasonable to write the generic $C$ as $C=\hat C+D$. Otherwise the choice $\hat C:=(X^TX)^{-1}X^T$ seems to come out of nowhere. If you want to deduce $\hat C$ from scratch, form (3) of the Gauss-Markov theorem offers a reasonable derivation via optimization.
The task in form (3) can be viewed as a set of constrained minimum-norm problems. How to see this: $Cy$ is unbiased for $\beta$, so that $CX=I_{p\times p}$. The objective function $E\| Cy-\beta\|^2$ then simplifies to $\sum_{k=1}^p \langle C_k,C_k\rangle$, where $C_k$ denotes the $k$th row of $C$; we are using the inner product $\langle u,v\rangle:=u^Tv$ on $R^n$. Moreover, the matrix equality $CX=I_{p\times p}$ is equivalent to $p^2$ constraints of the form $\langle C_k,X_j\rangle =\delta_{jk}$, where $X_j$ denotes the $j$th column of $X$ and $\delta_{jk}$ is the Dirac delta. The minimization of the objective can now be decomposed into $p$ separate norm-minimization problems, one for each row $C_k$; this is seen by inspection of the structure of the objective and constraints. This leads to the conclusion (via the Projection Theorem) that each row of the minimizing $\hat C$ must be a linear combination of the columns of $X$. Equivalently, $\hat C=BX^T$ for some $p\times p$ matrix $B$. Recalling the constraint $I=\hat CX$, we have $I=\hat CX=BX^TX$, whence $B=(X^TX)^{-1}$ and finally $\hat C=(X^TX)^{-1}X^T$.