Suppose a normal random regression model. Then the normal equations by the method of least squares expressed in matrix forms look like: $$Q=(Y-X\beta)^T(Y-X\beta)$$ where Q is the quantity we would like to minimize to get the least square estimates. Y is a ($n$ x 1) vector, X is a ($n$ x 2) matrix, and $\beta$ is a (2 x 1) vector.
To actually obtain the least square estimates, we need to differentiate Q with respect to $\beta$ as follows:
- First expand the quantity Q: $$Q=Y^TY-\beta^TX^TY-Y^TX\beta+\beta^TX^TX\beta$$
- Utilize $Y^TX\beta = \beta^TX^TY$ and manipulate terms: $$Q=Y^TY-2\beta^TX^TY+\beta^TX^TX\beta$$
- Then take the derivative of Q with respect to $\beta$ and equate the result to the zero vector, $0$: $${\partial Q\over\partial\beta}=-2X^TY+2X^TX\beta=0$$
However, I can't see how the vector transpose $\beta^T$ disappeared in the above equation when the derivative was taken with respect to $\beta$ for Q.
I believe the steps I am missing are in elementary matrix calculus, but how do you differentiate an equation that contains both $\beta^T$ and $\beta$ in its terms (for example, $Q=Y^TY-2\beta^TX^TY-\beta^TX^TX\beta$ in our earlier example) with respect to a vector $\beta$?
It is indeed a basic result of matrix calculus. For a vector $x$ and symmetric matrix $A$ you have $$ \frac{\partial }{ \partial x} x' A x = 2x'A. $$
You can derive this result by writing explicitly the quadratic form $x' A x$, i.e., $$ x'A x = \sum_i \sum_j x_i x_j a_{ij} = \sum x_i^2a_{ii} + 2\sum_{i > j}x_i x_ja_{ij}, $$ now by taking partial derivative w.r.t. to $x_k$ you'll get $$ \frac{\partial }{ \partial x_k} x'Ax = 2 x_k a_{kk} + 2 \sum_{i \neq k} x_ia_{ik} = 2 x ^ T a_{\cdot k}, $$ where $a_{\cdot k}$ is the $k$th column of $A$. Doing th same for every $k$, you'll get $$ 2x^TA. $$ In least square derivation $x = \beta$, and $A = X'X$, and instead of a row vector you're working with a column vector (the derivation is w.r.t. $\beta ^ T$), but the procedure is the same.