From time to time, I come across with derivation operations which are executed with regard to a vector. For example, the least squares estimation method with more than one explanatory variables is written like: $$y_i = \beta_1 + \beta_2 x_{2i} + ... + \beta_k x_{ki} + \epsilon_i $$
And then it is: $$ y = Xb + e $$
Where $y$ is the Nx1 column vector of target variables, $X$ is the Nxk matrix of the observation variables, $b$ is the kx1 column vector estimates of $\beta$ values and $e$ is the Nx1 column vector of residuals.
When arranged as $e =y - Xb$, the aim is to minimize the sum of the squares of residuals: $\sum_i e_i^2 = ||e||^2$.
Now, $||e||^2$ is a function of the vector $b$ and according to my naive understanding, we have to find partial derivatives $\dfrac{\partial(||e||^2)}{\partial b_i }$ for each $b_i$ component of $b$, set each of them to zero and solve the system of equations simultaneously.
But an alternative way of differentiation is shown as $\dfrac{\partial(||e||^2)}{\partial b } = -2X^Ty + 2X^TXb$. Here the derivative is taken w.r.t vector $b$.
Now derivatives with regards to a vector is a new concept for me. Is it a brand new thing or is it just a reorganization of numerous partial derivatives belonging to separate $b$ components into a unified matrix form? What exactly is going on here?
Thanks in advance.
You might have seen this as $\nabla (\|e\|^2)$, the gradient. It is simply a vector consisting of the partial derivatives $\frac\partial{\partial b_i}$ in its $i$-th component.
$$(\frac\partial{\partial b} f)_i = \frac\partial{\partial b_i} f$$